Cell-based detection and differentiation of lung cancer

ABSTRACT

The present invention provides a method for detecting and differentiating disease states with high sensitivity and specificity. The method allows for a determination of whether a cell-based sample contains abnormal cells and, for certain diseases, is capable of determining the histologic type of disease present. The method detects changes in the level and pattern of expression of the molecular markers in the cell-based sample. Panel selection and validation procedures are also provided.

BACKGROUND OF THE INVENTION

The present invention relates to early detection of a general diseasestate in a patient. The present invention also relates to discrimination(differentiation) between specific disease states in their early stages.

Early detection of a specific disease state can greatly improve apatient's chance for survival by permitting early diagnosis and earlytreatment while the disease is still localized and its pathologiceffects limited anatomically and physiologically. Two key evaluativemeasures of any test or disease detection method are its sensitivity(Sensitivity=True Positives/(True Positives+False Negatives) andspecificity (Specificity True Negatives/(False Positives+TrueNegatives), which measure how well the test performs to accuratelydetect all affected individuals without exception, and without falselyincluding individuals who do not have the target disease. Historically,many diagnostic tests have been criticized due to poor sensitivity andspecificity.

Sensitivity is a measure of a test's ability to detect correctly thetarget disease in an individual being tested. A test having poorsensitivity produces a high rate of false negatives, i.e., individualswho have the disease but are falsely identified as being free of thatparticular disease. The potential danger of a false negative is that thediseased individual will remain undiagnosed and untreated for someperiod of time, during which the disease may progress to a later stagewherein treatments, if any, may be less effective. An example of a testthat has low sensitivity is a protein-based blood test for HIV. Thistype of test exhibits poor sensitivity because it fails to detect thepresence of the virus until the disease is well established and thevirus has invaded the bloodstream in substantial numbers. In contrast,an example of a test that has high sensitivity is viral-load detectionusing the polymerase chain reaction (PCR). High sensitivity is achievedbecause this type of test can detect very small quantities of the virus(see Lewis, D. R. et al. “Molecular Diagnostics: The Genomic BridgeBetween Old and New Medicine: A White Paper on the Diagnostic Technologyand Services Industry” Thomas Weisel Partners, Jun. 13, 2001).

Specificity, on the other hand, is a measure of a test's ability toidentify accurately patients who are free of the disease state. A testhaving poor specificity produces a high rate

Specificity, on the other hand, is a measure of a test's ability toidentify accurately patients who are free of the disease state. A testhaving poor specificity produces a high rate of false positives, i.e.,individuals who are falsely identified as having the disease. A drawbackof false positives is that they force patients to undergo unnecessarymedical procedures treatments with their attendant risks, emotional andfinancial stresses, and which could have adverse effects on thepatient's health. A feature of diseases which makes it difficult todevelop diagnostic tests with high specificity is that diseasemechanisms often involve a plurality of genes and proteins.Additionally, certain proteins may be elevated for reasons unrelated toa disease state. An example of a test that has high specificity is agene-based test that can detect a p53 mutation. A p53 mutation willnever be detected unless there are cancer cells present (see Lewis, D.R. et al. “Molecular Diagnostics: The Genomic Bridge Between Old and NewMedicine: A White Paper on the Diagnostic Technology and ServicesIndustry” Thomas Weisel Partners, Jun. 13, 2001).

Cellular markers are naturally occurring molecular structures withincells that can be discovered and used to characterize or differentiatecells in health and disease. Their presence can be detected by probes,invented and developed by human beings, which bind to markers enablingthe markers to be detected through visualization and/or quantified usingimaging systems. Four classes of cell-based marker detectiontechnologies are cytopathology, cytometry, cytogenetics and proteomics,which are identified and described below.

Cytopathology relies upon the visual assessment by human experts ofcytomorphological changes within stained whole-cell populations. Anexample is the cytological screening and cytodiagnosis ofPapanicolaou-stained cervical-vaginal specimens by cytotechnologists andcytopathologists, respectively. Unlike cytogenetics, proteomics andcytometry, cytopathology is not a quantitative tool. While it is thestate-of-the-art in clinical diagnostic cytology, it is subjective andthe diagnostic results are often not highly sensitive or reproducible,especially at early stages of cancer (e.g., ASCUS, LSIL).

Tests that rely on morphological analyses involve observing a sample ofa patient's cells under a microscope to identify abnormalities in celland nuclear shape, size, or staining behavior. When viewed through amicroscope, normal mature epithelial cells appear large and welldifferentiated, with condensed nuclei. Cells characterized by dysplasia,however, may be in a variety of stages of differentiation, with somecells being very immature. Finally, cells characterized by invasivecarcinoma often appear undifferentiated, with very little cytoplasm andrelatively large nuclei.

A drawback to diagnostic tests that rely on morphological analyses isthat cell morphology is a lagging indicator. Since form followsfunction, often the disease state has already progressed to a criticalstage by the time the disease becomes evident by morphological analysis.The initial stages of a disease involve-chemical changes at a molecularlevel. Changes that are detectable by viewing cell features under amicroscope are not apparent until later stages of the disease.Therefore, tests that measure chemical changes on a molecular level,referred to as “molecular diagnostic” tests, are more likely to provideearly detection than tests that rely on morphological analyses alone.

Cytometry is based upon the flow-microfluorometric instrumental analysisof fluorescently stained cells moving in single file in solution (flowcytometry) or the computer-aided microscope instrumental analysis ofstained cells deposited onto glass microscope slides (image cytometry).Flow cytometry applications include leukemia and lymphomaimmunophenotyping. Image cytometry applications include DNA ploidy,Malignancy-Associated Changes (MACs) and S-phase analyses. The flow andimage cytometry approaches yield quantitative data characterizing thecells in suspension or on a glass microscope slide. Flow and imagecytometry can produce good marker detection and differentiation resultsdepending upon the sensitivity and specificity of the cellular stainsand flow/image measurement features used.

Malignancy-Associated Changes (MACs) have been qualitatively observedand reported since the early to mid-1900 's (OC Gruner: “Study of thechanges met with leukocytes in certain cases of malignant disease” inBrit J Surg 3: 506-522, 1916) (H E Neiburgs, F G Zak, D C Allen, HReisman, T Clardy: “Systemic cellular changes in material from human andanimal tissues” in Transactions, 7^(th) Ann Mtg Inter Soc Cytol Council,pp 137-144, 1959). From the mid-1900's through 1975, MACs weredocumented in independent qualitative histology and cytology studies inbuccal mucosa and buccal smears (Nieburgs, Finch, Klawe), duodenum(Nieburgs), liver (Elias, Nieburgs), niegakaryocytes (Ramsdahl), cervix(Nieburgs, Howdon), skin (Kwitiken), blood and bone marrow (Nieburgs),monocytes and leukocytes (van Haas, Matison, Clausen), and lung andsputum (Martuzzi and Oppen Toth). Before 1975 these qualitative studiesreported MAC-based sensitivities for specific disease detection from 76%to 97% and specificities from 50% to 90%. In 1975 Oppen Toth reported asensitivity of 76% and specificity of 81% in a qualitative sputumanalysis study.

Quantitative observations regarding MAC-based probe analysis began twoto three decades ago (H Klawe, J Rowinski: “Malignancy associatedchanges (MAC) in cells of buccal smears detected by means of objectiveimage analysis” in Acta Cytol 18: 30-33, 1974) (G L Wied, P H Bartels, MBibbo, J J Sychra: “Cytomorphometric markers for uterine cancer inintermediate cells” in Analyt Quant Cytol 2: 257-263, 1980) (G Burger, UJutting, K Rodenacker: “Changes in benign population in cases ofcervical cancer and its precursors” in Analyt Quant Cytol 3: 261-271,1981). MACs were documented in independent quantitative histology andcytology studies in buccal mucosa and smears Klawe, Burger), cervix(Wied, Burger, Bartels, Vooijs, Reinhardt, Rosenthal, Boon, Katzke,Haroske, Zahniser), breast (King, Bibbo, Susnik), bladder and prostate(Sherman, Montironi), colon (Bibbo), lung and sputum (Swank, MacAulay,Payne), and nasal mucosa (Reith) studies with MAC-based sensitivitiesfrom 70% to 89% and specificities from 52% to 100%. Marek and Nakhosteenshowed (1999, American Thoracic Society annual meeting) the results fromtwo quantitative pulmonary studies showing (a) sensitivity of 89% andspecificity of 92%, and (b) sensitivity of 91% and specificity of 100%.

Clearly, Malignancy-Associated Changes (MACs) are potentially usefulprobes that result from the image-cytometry marker detection technology.MAC-based features from DNA-stained nuclei can be used in conjunctionwith other molecular diagnostic probes to create optimized moleculardiagnostic panels for the detection and differentiation of lung cancerand other disease states.

Cytogenetics detects specific chromosome-based intracellular changesusing, for example, in situ hybridization (ISH) technology. ISHtechnology can be based upon fluorescence (FISH), multi-colorfluorescence (M-FISH), or light-absorption-based chromogenics imaging(CHRISH) technologies. The family of ISH technologies uses DNA or RNAprobes to detect the presence of the complementary DNA sequence incloned bacterial or cultured eukaryotic cells. FISH technology can, forexample, be used for the detection of genetic abnormalities associatedwith certain cancers. Examples include probes for Trisomy 8 and HER-2neu. Other technologies such as polymerase chain reactions (PCR) can beused to detect B-cell and T-cell gene rearrangements. Cytogenetics is ahighly specific marker detection technology since it detects thecausative or “trigger” molecular event producing a pathology condition.It may be less sensitive than the other marker detection technologiesbecause fewer events may be present to detect. In situ hybridization(ISH) is a molecular diagnostic method uses gene-based analyses todetect abnormalities on the genetic level such as mutations, chromosomeerrors or genetic material inserted by a specific pathogen. For example,in situ hybridization may involve measuring the level of a specific mRNAby treating a sample of a patient's cells with labeled primers designedto hybridize to the specific mRNA, washing away unbound primers andmeasuring the signal of the label. Due to the uniqueness of genesequences, a test involving the detection of gene sequences will likelyhave a high specificity, yielding very few false positives. However,because the amount of genetic material in a sample of cells may be verylow, only a very weak signal may be obtained. Therefore, in situhybridization tests that do not employ pre-amplification techniques willlikely have a poor specificity, yielding many false negatives.

Proteomics depends upon cell characterization and differentiationresulting from the over-expression, under-expression, orpresence/absence of unique or specific proteins in populations of normalor abnormal cell types. Proteomics includes not only the identificationand quantification of proteins, but also the determination of theirlocalization, modifications, interactions, chemical activities, andcellular/extracellular functions. Immunochemistry (immunocytochemistryin cells and immunohistochemistry (IHC) in tissues) is the technologyused, either qualitatively or quantitatively (QIHC) to stain antigens(i.e., proteomes) using antibodies. Immunostaining procedures use a dyeas the detection indicator. Examples of IHC applications includeanalyses for ER (estrogen receptor), PR progesterone receptor), p53tumor suppressor genes, and EGRF prognostic markers. Proteomics istypically a more sensitive marker detection technology than cytogeneticsbecause there are often orders of magnitude more protein molecules todetect using proteomics than there are cytogenetic mutations orgene-sequence alterations to detect using cytogenetics. However,proteomics may have a poorer specificity than the cytogenetic markerdetection technology since multiple pathologies may result in similarchanges in protein over-expression or under-expression. Immunochemistryinvolves histological or cytological localization of immunoreactivesubstances in tissue sections or cell preparations, respectively, oftenutilizing labeled antibodies as probe reagents. Immunochemistry can beused to measure the concentration of a disease marker (specific protein)in a sample of cells by treating the cells with an agent such as alabeled antibody (probe) that is specific for an epitope on the diseasemarker, then washing away unbound antibodies and measuring the signal ofthe label. Immunochemistry is based on the property that cancer cellspossess different levels of certain disease markers than do healthycells. The concentration of a disease marker in a cancer cell isgenerally large enough to produce a large signal. Therefore, tests thatrely on immunochemistry will likely have a high sensitivity, yieldingfew false negatives. However, because other factors in addition to thedisease state may cause the concentration of a disease marker to becomeraised or lowered, tests that rely on immunochemical analysis of aspecific disease marker will likely have poor specificity, yielding ahigh rate of false positives.

The present invention provides a noninvasive disease state detection anddiscrimination method with both high sensitivity and high specificity.The method involves contacting a cytological sample suspected ofcontaining diseased cells with a panel of probes comprising a pluralityof agents, each of which quantitatively binds to a specific diseasemarker, and detecting and analyzing the pattern of binding of the probeagents. The present invention also provides methods of constructing andvalidating a panel of probes for detecting a specific disease (or groupof diseases) and discriminating among its various disease states.Illustrative panels for detecting lung cancer and discriminating amongdifferent types of lung cancer are also provided.

A human disease results from the failure of the human organism'sadaptive mechanisms to neutralize external or internal insults whichresult in abnormal structures or functions within the body's cells,tissues, organs or systems. Diseases can be grouped by shared mechanismsof causation as illustrated below, in Table 1. TABLE 1 Classes ofDiseases Examples of Disease States Allergy Adverse reactions to foodsand plants Cardiovascular Heart failure, atherosclerosis Degenerative(neurological and Alzheimer's and Parkinson's muscular) DietNon-nutritional substances and excess/imbalanced nutrition HereditarySickle cell anemia, cystic fibrosis Immune HIV and autoimmune InfectionViral, bacterial, fungal, parasitic Metabolic Diabetes Molecular andcell biology Cancer (neoplasia) Toxic insults Alcohol, drugs,environmental mutagens and carcinogens Trauma Bodily injury fromautomobile collision

Disease states are either caused by or result in abnormal changes (i.e.,pathological conditions) at a subcellular, cellular, tissue, organ, orhuman anatomic or physiological system level. Many disease states (e.g.,lung cancer) are characterized by abnormal changes at a subcellular orcellular level. Specimens (e.g., cervical PAP smears, voided urine,blood, sputum, colonic washings) can be collected from patients withsuspected disease states to diagnose those patients for the presence andtype of the disease state. Molecular pathology is the discipline thatattempts to identify and diagnostically exploit the molecular changesassociated with these cell-based diseases.

Lung cancer is an illustrative example of a disease state in whichscreening of high-risk populations and at-risk individuals can beperformed using diagnostic tests (e.g., molecular diagnostic panelassays) to detect the presence of the disease state. Also, for patientsin which lung cancer or other disease states have been detected by thesemeans, related diagnostic tests can be employed to differentiate thespecific disease state from related or co-occurring disease states. Forexample, in this lung cancer illustration, additional moleculardiagnostic panel assays may indicate the probabilities that thepatient's disease state is consistent with one of the following types oflung cancer: (a) squamous cell carcinoma of the lung, (b) adenocarcinomaof the lung, (c) large cell carcinoma of the lung, (d) small cellcarcinoma of the lung, or (e) mesothelioma. Early detection anddifferentiation of cell-based disease states is a hypothesized means toimprove patient outcomes.

Cancer is a neoplastic disease the natural course of which is fatal.Cancer cells, unlike benign tumor cells, exhibit the properties ofinvasion and metastasis and are highly anaplastic. Cancer includes thetwo broad categories of carcinoma and sarcoma, but in normal usage it isoften used synonymously with carcinoma. According to the World HealthOrganization (WHO), cancer affects more than 10 million people each yearand is responsible for in excess of 6.2 million deaths.

Cancer is, in reality, a heterogeneous collection of diseases that canoccur in virtually any part of the body. As a result, differenttreatments are not equally effective in all cancers or even among thestages of a specific type of cancer. Advances in diagnostics (e.g.,mammography, cervical cytology, and serum PSA testing) have, in somecases, allowed for the detection of early-stage cancer when there are agreater number of treatment options, and therapies tend to be moreeffective. In cases where a solid tumor is small and localized, surgeryalone may be sufficient to produce a cure. However, in cases where thetumor has spread, surgery may provide, at best, only limited benefits.In such cases the addition of chemotherapy and/or radiation therapy maybe used to treat metastatic disease. While somewhat effective inprolonging life, treatment of patients with metastatic disease rarelyproduces a cure. Even through there may be an initial response, withtime the disease progresses and the patient ultimately dies from itseffects and/or from the toxic effects of the treatments.

While not proven, it is generally accepted that early detection andtreatment will reduce the morbidity, mortality and cost of cancer. Earlydetection will, in many cases, permit treatment to be initiated prior tometastasis. Furthermore, because there are a greater number of treatmentoptions, there is a higher probability of achieving a cure orsignificant improvement in long-term survival.

Developing a test that can be used to screen an “at-risk” population haslong been a goal of health practitioners. While there have been somesuccesses such as mammography for breast cancer, PSA testing forprostate cancer, and the PAP smear for cervical cancer, in most casescancer is detected at a relatively late stage where the patient issymptomatic and the disease is almost always fatal. For most cancers, notest or combination of tests has exhibited the necessary sensitivity andspecificity to permit cost-effective identification of patients withearly stage disease.

For a cancer screening program to be successful and gain acceptance bypatients, physicians, and third party payers, the test must have impliedbenefit (changes the outcome), be widely available and be able to becarried out readily within the framework of general healthcare. The testshould be relatively noninvasive, leading to adequate compliance, havehigh sensitivity, and reasonable specificity and predictive value. Inaddition, the test must be available at relatively low cost.

For patients who are suspected of having cancer, the diagnosis must beconfirmed and the tumor properly staged cytologically and clinically inorder for physicians to undertake appropriate therapeutic intervention.Some tests currently being used in the diagnosis and staging of cancer,however, either lack sufficient sensitivity or specificity, are tooinvasive, or are too costly to justify their use as a population-basedscreening test. Shown below in Tables 2 and 3, for example, areestimates of sensitivity and specificity of lung cancer diagnostics andestimated costs for diagnostic tests used to detect lung cancer. TABLE 2ESTIMATES OF SENSITIVITY AND SPECIFICITY OF LUNG CANCER DIAGNOSTICS [1]DIAGNOSTIC TEST SENSITIVITY (%) SPECIFICITY (%) Conventional Sputum 51.0100.0 Cytology Chest X-ray  16-85* 90-95 White Light Bronchoscopy48.0-80.0 91.1-96.8 LIFE Bronchoscopy 72.0  86.7 Computed Tomography63.0-99.9 80.0-61   PET Scan 88.0-92.5 83.0-93.0*Dependent upon the stage of the disease at the time of diagnosis

TABLE 3 ESTIMATED COSTS FOR DIAGNOSTIC TESTS USED IN LUNG CANCER [1]DIAGNOSTIC TEST COST ($) Sputum Cytology 90 Chest X-ray 44 Bronchoscopy725 Computed Tomography 378 PET Scan  800-3000 Open Biopsy 12,847-14,121

The chest radiograph (X-ray) is often used to detect and localize cancerlesions due to its reasonable sensitivity, high specificity and lowcost. However, small lesions are often difficult to detect and althoughlarger tumors are relatively easy to visualize on a chest film, at thetime of detection most have already metastasized. Thus, chest X-rayslack the necessary sensitivity for use as an early detection method.

Computed tomography (CT) is useful in the confirmation andcharacterization of pulmonary nodules and allows the detection of subtleabnormalities that are often missed on a standard chest X-ray [2]. CT,and Spiral CT methods in particular, remains the test of choice forpatients who present with a prior malignant sputum cytology result orvocal chord paralysis. CT, with its improved sensitivity over theconventional chest film, has become the primary tool for imaging thecentral airway [3]. While capable of examining large areas, CT issubject to artifacts from cardiac and respiratory motion althoughimproved resolution can be achieved through the use of iodinatedcontrast material.

Spiral CT is a more rapid and sensitive form of CT that has thepotential to detect early cancer lesions more reliably than eitherconventional CT or X-ray. Spiral CT appears to have greatly improvedsensitivity in diagnosing early disease. However, the test hasrelatively low specificity with a 20% false positive rate [4]. Spiral CTis also less sensitive in detecting the central lesions that representone-third of all lung cancers. Furthermore, while the cost of theinitial test is relatively low ($300), the cost of follow-up can behigh. Cytology using molecular diagnostic panel assays offerssignificant promise as an adjunctive test with Spiral CT to improve thespecificity of Spiral CT testing by minimizing false positive resultsthrough the evaluation of fine needle aspirations (FNAs) or biopsies(FNBs) from Spiral CT-suspicious pulmonary nodules.

Fluorescence bronchoscopy provides increased sensitivity overconventional white light bronchoscopy, significantly improving thedetection of small lesions within the central airway [5]. However,fluorescence bronchoscopy is unable to detect peripheral lesions, ittakes a long time for bronchoscopists to examine a patient's airways,and it is an expensive procedure. Additionally, the procedure ismoderately invasive, creating an insurmountable barrier to its use as apopulation-based screening test.

Positron Emission Tomography (PET) is a highly sensitive test thatutilizes radioactive glucose to identify the presence of cancer cellswithin the lung [6-8]. The cost of establishing a testing facility ishigh and there is the need for a cyclotron on site or nearby. This,coupled with the high cost of the test, has limited the use of PET scansto staging lung cancer patients rather than for early detection of thedisease.

Although used for some time as a means of screening for lung cancer,sputum cytology has enjoyed only limited success due to its lowsensitivity and its failure to reduce disease-specific mortality. Inconventional sputum cytology, the pathologist uses characteristicchanges in cellular morphology to identify malignant cells and make adiagnosis of cancer. Today only 15% of patients who are “at-risk” or whoare suspected of having lung cancer undergo sputum cytology testing, andless than 5% undergo multiple evaluations [9]. A number of factorsincluding tumor size, location, degree of differentiation, cellclumping, inefficiency of clearing mechanisms to release cells andsputum to the external environment, and the poor stability of cellswithin the sputum contribute to the overall poor performance of thetest.

Cancer diagnostics has traditionally relied upon the detection of singlemolecular markers. Unfortunately, cancer is a disease state in whichsingle markers have typically failed to detect or differentiate manyforms of the disease. Thus, probes that recognize only a single markerhave been shown to be largely ineffective. Exhaustive searches for“magic bullet” diagnostic tests have been underway for many decadesthough no universal successful magic bullet probes have been found todate.

A major premise of this invention is that cell-based cancer diagnosticsand the screening, diagnosis for, and therapeutic monitoring of otherdisease states will be significantly improved over the state-of-the-artthat uses single marker/probe analyses rather than kits of multiple,simulaneously labeled probes. This multiplexed analytical approach isparticularly well suited for cancer diagnostics since cancer is not asingle disease. Furthermore, this multi-factorial “panel” approach isconsistent with the heterogeneous nature of cancer, both cytologicallyand clinically.

Key to the successful implementation of a panel approach to cell-baseddiagnostic tests is the design and development of optimized panels ofprobes that can chemically recognize the pattern of markers thatcharacterizes and distinguishes a variety of disease states. This patentapplication describes an efficient and unique methodology to design anddevelop such novel and optimized panels.

Improved methods for specimen collection (e.g., point-of-care mixers forsputum cytology) and preparation (e.g., new cytology preservation andtransportation fluids, and liquid-based cytology preparationinstruments) are under development and becoming commercially available.In conjunction with existing and these emerging methods, a successfulimplementation of this molecular diagnostics cell-based panel assay willlead to (a) characterization of the molecular profile of malignanttumors and other disease states, (b) improved methods for early cancerand other disease state detection and differentiation, and (c)opportunities for improved clinical diagnoses, prognoses, customizedpatient treatments, and therapeutic monitoring.

SUMMARY OF THE INVENTION

The present invention is directed to a panel for detecting a genericdisease state or discriminating between specific disease states usingcell-based diagnosis. The panel comprises a plurality of probes each ofwhich specifically binds to a marker associated with a generic orspecific disease state, wherein the pattern of binding of the componentprobes of the panel to cells in a cytology specimen is diagnostic of thepresence or specific nature of said disease state. The present inventionis also directed to a method of forming a panel for detecting a diseasestate or discriminating between disease states in a patient usingcell-based diagnosis. The method involves determining the sensitivityand specificity of binding of probes each of which specifically binds toa member of a library of markers associated with a disease state andselecting a limited plurality of said probes whose pattern of binding isdiagnostic for the presence or specific nature of said disease state.The present method is also directed to a method of detecting a diseaseor discriminating between disease states comprising. The method involvescontacting a cytological sample suspected of containing abnormal cellscharacteristic of a disease state with a panel according to claim 1 anddetecting a pattern of binding of said probes that is diagnostic for thepresence or specific nature of said disease state.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1. Molecular markers that are preferable markers to be included ina panel for identifying different histologic types of lung cancer. Thecolumn labeled “%” indicates the percentage of tumor specimens thatexpress a particular marker.

FIG. 2. Potential ways in which different markers may be used todiscriminate between specific types of lung cancer. SQ indicatessquamous cell carcinoma, AD indicates adenocarcinoma, LC indicates largecell carcinoma, SC indicates small cell carcinoma and ME indicatesmesothelioma. The numbers appearing in each cell represent frequency ofmarker change in one cell type versus another. To be included in thetable, the ratio must be greater than 2.0 or less than 0.5. A numberlarger than 100 generally indicates that the second marker is notexpressed. In such cases: the denominator was set at 0.1 for the purposeof the analysis. Finally, empty cells represent either no difference inexpression or the absence of expression data.

FIG. 3. Comparisons between H-scores for probes 7 and 15 in controltissue and in cancerous tissue. The x-axis shows the H-scores while they-axis shows the percent of cases.

FIG. 4. Correlation matrix, in which correlation measures the amount oflinear association between a pair of variables. All markers in thismatrix with a correlation number of 50% or higher are consideredcorrelate markers.

FIG. 5. Detection panel compositions, pair-wise discrimination panelcompositions and joint discrimination panel compositions. Panelcompositions using decision tree analysis, stepwise LR and stepwise LDare shown.

FIG. 6. Detection panel compositions wherein probe 7 was not included asa probe. Panel compositions using decision tree analysis, stepwise LRand stepwise LD are shown.

FIG. 7. Detection panel compositions using only commercially preferredprobes. Panel compositions using decision tree analysis, stepwise LR andstepwise LD are shown.

DETAILED DESCRIPTION OF THE INVENTION

1. Introduction

The present invention provides a noninvasive disease state detection anddiscrimination method with high sensitivity and specificity. The methodinvolves contacting a cytological sample suspected of containingdiseased cells with a panel comprising a plurality of agents, each ofwhich quantitatively binds to a disease marker, and detecting a patternof binding of the agents. This pattern includes the localization anddensity/concentration of binding of the component probes of the panel.The present invention also provides methods of making a panel fordetecting a disease and also for discriminating between disease statesas well as panels for detecting lung cancer in early stages anddiscriminating between different types of lung cancer. Panel tests havebeen used in medicine. For example, panels are used in blood serumanalysis. However, because a cytology analysis involves imaging andlocalization of specific markers within individual cells and tissues,prior to the present invention it was not apparent that the panelapproach would be effective for cytology samples. Additionally, it wasnot apparent which, if any statistical analyses could be applied todesign and develop an optimized cell-based diagnostic panel of probes.

One of the few examples of a cytology-based screening program is the PAPSmear, which screens for cervical cancer. For over 50 years this methodhas been practiced and has greatly contributed to the fact that today,almost no woman who has regular PAP smears dies of cervical cancer.There are drawbacks, however, to the PAP smear screening program. Forexample, PAP smears are labor intensive and are not universallyaccessible. The present molecular diagnostic cell-based screening methodutilizing probe panels does not suffer from these drawbacks. The methodmay be fully automated and thereby made less expensive, increasingaccess to this type of testing.

The present invention provides a method, having both high specificityand high sensitivity, for detecting a disease state and fordiscriminating between disease states. The invention is applicable toany cell-based disease state, such as cancer and infectious diseases.

The panel is diagnostic of the presence or specific nature of thedisease state. The present invention overcomes the limitations anddrawbacks of known disease state detection methods by enabling quick,accurate, relatively noninvasive and easy detection and discriminationof diseased cells in a cytological sample while keeping costs low.

A feature of the inventive method for making a panel of the presentinvention is the rapidity with which the panel may be developed.

There are several benefits to using a panel of agents in a method fordetecting a disease state, and for discriminating between types ofdisease states. One benefit is that a panel of agents has sufficientredundancy to permit detection and characterization of disease statesthereby increasing the sensitivity and specificity of the test. Giventhe heterogeneous nature of many disease states, no single agent iscapable of identifying the vast majority of cases.

An additional benefit to using a panel is that use of a panel permitsdiscrimination between the various types of a disease state based onspecific patterns (probe localization and density/concentration) ofexpression. As the various types of a disease may exhibit dramaticdifferences in their rate of progression, response to therapy, andlethality, knowledge of the specific type can help physicians choose theoptimal therapeutic approach.

2. The Panel

The panel of the present invention comprises a plurality of agents, eachof which quantitatively binds to a disease marker, wherein the pattern(localization and density/concentration) of binding of the componentagents of the panel is diagnostic of the presence or specific nature ofa disease state. Therefore, the panel may be a detection panel or adiscrimination panel. A detection panel detects whether a genericdisease state is present in a sample of cells, while a discriminationpanel discriminates among different specific disease states in a sampleof cells known to be affected by a disease state which comprisesdifferent types of diseases. The difference between a detection paneland a discrimination panel lies in the specific agents that the panelscomprise. A detection panel comprises agents having a pattern of bindingthat is diagnostic of the presence of a disease state, while adiscrimination panel comprises agents having a pattern of binding thatallows for determining the specific nature (i.e., each type) of thedisease state.

A panel, by definition, contains more than one member. There are severalreasons why it is beneficial to use a panel of markers rather than justone marker alone to detect a generic disease state or to discriminateamong specific disease states. One reason is the unlikely existence of aprobe for one single marker, that is present in all diseased cells yetnot present in healthy cells, whose behavior can be measured with a highspecificity and sensitivity to yeild an accurate test result. If such asingle probe existed for detection of a particular disease with highsensitivity and specificity, it would already have been utilized forclinical testing. Rather, it is the directed selection of panel tests,each consisting of multiple probes, that together can provide the rangeof detection capability to ensure clinically adequate testing.

If one nevertheless chooses to construct a panel test comprising one ora very few probes, then the failure of any single marker/probecombination to perform its labeling function for any reason (forexample, diminished reactivity of the specimen cells due to biologicalvariability; inherent variability between lots of probe reagents; aweak, outdated or defective processing reagent; improper processing timeor conditions for that probe) could result in a catastrophic failure ofthe test to detect or discriminate the target disease. The inclusion ofmultiple, and even redundant probes in each panel test greatly enhancesthe probability that a failure of any one probe will not cause acatastrophic failure of the test.

A probe is any molecular structure or substructure that binds to adisease marker. The term “agent” as used herein, may also refer to amolecular structure or substructure that binds to a disease marker.Molecular probes are homing devices used by biologists and clinicians todetect and locate markers indicative of the specific disease states. Forexample, antibodies may be produced that bind specifically to a proteinpreviously identified as a marker for small cell lung cancer. Thisantibody probe can then be used to localize the target protein marker incells and tissues of patients suspected of having the disease by usingappropriate immunochemical protocols and incubations. If the antibodyprobe binds to its target marker in a stoichiometric (i.e.,quantitative) fashion and is labeled with a chromogenic or colored“tag”, then localization and quantitation of the probe and, indirectly,its target marker may be accomplished using an optical microscope andimage cytometry technology.

The present invention contemplates detecting changes in molecular markerexpression at the DNA, RNA or protein level using any of a number ofmethods available to an ordinary skilled artisan. Exemplary probes maybe a polyclonal or monoclonal antibody or fragment thereof or a nucleicacid sequences that is complementary to the nucleic acid sequenceencoding a molecular marker in the panel. A probe may also be a stain,such as a DNA stain. Many of the antibodies used in the presentinvention are specific to a variety of cell surface or intracellularantigens as marker substances. The antibodies may be synthesized usingtechniques generally known to those of skill in the art. For example,after the initial raising of antibodies to the marker, the antibodiescan be sequenced and subsequently prepared by recombinant techniques.Alternatively, antibodies may be purchased.

In embodiments of the present invention, the probe contains a label. Aprobe containing a label is often referred to herein as a “labeledprobe”. The label may be any substance that can be attached to a probeso that when the probe binds to the marker a signal is emitted or thelabeled probe can be detected by a human observer or an analyticalinstrument. This label may also be referred to as a “tag”. The label maybe visualized using reader instrumentation. The term “readerinstrumentation” refers to the analytical equipment used to detect aprobe. Labels envisioned by the present invention are any labels thatemit a signal and allow for identification of a component in a sample.Preferred labels include radioactive, fluorogenic, chromogenic orenzymatic moieties. Therefore, possible methods of detection include,but are not limited to, immunocytochemistry, immunohistochemistry, insitu hybridization, fluorescent in situ hybridization, flow cytometryand image cytometry. The signal generated by the labeled probe is ofsufficient intensity to permit detection by a medical practitioner.

A “marker”, “disease marker” or “molecular marker” is any molecularstructure or substructure that is correlated with a disease state orpathogen. The term “antigen” may be used interchangeably with “marker”.Broadly defined, a marker is a biological indicator that may bedeliberately used by an observer or instrument to reveal, detect, ormeasure the presence or frequency and/or amount of a specific condition,event or substance. For example, a specific and unique sequence ofnucleotide bases may be used as a genetic marker to track patterns ofgenetic inheritance among individuals and through families. Similarly,molecular markers are specific molecules, such as proteins or proteinfragments, whose presence within a cell or tissue indicates a particulardisease state. For example, proliferating cancer cells may express novelcell-surface proteins not found on normal cells of the same type, or mayover-express specific secretory proteins whose increased or decreasedabundance (e.g., overexpression or underexpression, respectively) canserve as markers for a particular disease state.

Suitable markers for cytology panels are substances that are localizedin or on the nucleus, cytoplasm or cell membrane. Markers may also belocalized in organelles located in any of these locations in the cell.Exemplary markers localized in the nucleus include but are not limitedto retinoblastoma gene product (Rb), Cyclin A, nucleoside diphosphatekinase/nm23, telomerase, Ki-67, Cyclin D1, proliferating cell nuclearantigen (PCNA), p120 (proliferation-associated nucleolar antigen) andthyroid transcription factor 1 (TTF-1). Exemplary markers localized inthe cytoplasm include but are not limited to VEGF, surfactant apoproteinA (SP-A), nucleoside nm23, melanoma antigen-1 (MAGE-1), Mucin 1,surfactant apoprotein B (SP-B), ER related protein p29 and melanomaantigen-3 (MAGE-3). Exemplary markers localized in the cell membraneinclude but are not limited to VEGF, thrombomodulin, CD44v6, E-Cadherin,Mucin 1, human epithelial related antigen (HERA), fibroblast growthfactor (FGF), heptocyte growth factor receptor (C-MET), BCL-2,N-Cadherin, epidermal growth factor receptor (EGFR) and glucosetransporter-3 (GLUT-3). An example of a marker located in an organelleof the cytoplasm is BCL-2, located (in part) in the mitochondrialmembrane. An example of a marker located in an organelle of the nucleusis p120 (proliferating-associated nucleolar antigen), located in thenucleoli.

Preferred are markers where changes in expression: occur early indisease progression, are exhibited by a majority of diseased cells,allow for detection of in excess of 75% of a given disease type, mostpreferably in excess of 90% of a given disease type and/or allow for thediscrimination between the nature of different types of a disease state.

It is noted that the inventive panel may be referred to as a panel ofprobes or a panel of markers, since the probes bind to the markers.Therefore, the panel may comprise a number of markers or it may comprisea number of probes that bind to specific markers. For the sake ofconsistency, the present panel is referred to as a panel of probes;however, it could also be referred to as a panel of markers.

Markers can also include features such as malignancy-associated changes(MACs) in the cell nucleus or features related to the patient's familyhistory of cancer. Malignancy-associated changes, or MACs, are typicallysub-visual changes that occur in normal-appearing cells located in thevicinity of cancer cells. These exceedingly subtle changes in the cellnucleus may result biologically from changes in the nuclear matrix andthe chromatin distribution pattern. They cannot be appreciated even bytrained observers through the visual observation of individual cells,but may be determined from statistical analysis of cell populationsusing highly automated, computerized high-speed image cytometry.Techniques for detection of MACs are well known to those of skill in theart and are described in more detail in: Gruner, O. C. Brit J. Surg. 3506-522 (1916); Neiburgs, H. E. et al., Transaction, 7^(th) Annual Mtg.Inter. Soc. Cytol. Council 137-144 (1959); Klawe, H. Acta. Cytol. 1830-33 (1974); Wied, G. L., et al., Analty. Quant. Cytol. 2 257-263(1980); and Burger, G., et al., Analyt. Quant. Cytol. 3 261-271 (1981).

The present invention encompasses any marker that is correlated with adisease state. The individual markers themselves are mere tools of thepresent invention. Therefore, the invention is not limited to specificmarkers. One way to classify markers is by their functional relationshipto other molecules. As used herein, a “functionally related” marker is acomponent of the same biological process or pathway as the marker inquestion and would be known by a person of skill in the art to beabnormally expressed together with the marker in question. For example,many markers are associated with a cell proliferation pathway, such asfibrobast growth factor (FGF), (vascular endothelial growth factor)VEGF, CyclinA and Cyclin D1. Other markers are glucose transporters,such as Glut-1 and Glut-3.

A person of ordinary skill in the art is well equipped to determine afunctionally related marker and may research various markers or performexperiments in which the functional behavior of a marker is determined.By way of non-limiting example, a marker may be classified as a moleculeinvolved in angiogenesis, a transmembrane glycoprotein, a cell surfaceglycoprotein, a pulmonary surfactant protein, a nuclear DNA-bindingphosphoprotein, a transmembrane Ca²⁺ dependent cell adhesion molecule, aregulatory subunit of the cyclin-dependent kinases (CDK's), a nucleosidediphosphate kinase, a ribonucleoprotein enzyme, a nuclear protein thatis expressed in proliferating normal and neoplastic cells, a cofactorfor DNA polymerase delta, a gene that is silent in normal tissues yetwhen it is expressed in malignant neoplasms is recognized by autologous,tumor-directed and specific cytotoxic T cells (CTL's), a glycosylatedsecretory protein, the gastrointestinal tract or genitourinary tract, ahydrophobic protein of a pulmonary surfactant, a transmembraneglycoprotein, a molecule involved in proliferation, differentiation andangiogenesis, a proto-oncogene, a homeodomain transcription factor, amitochondrial membrane protein, a molecule found in nucleoli of arapidly proliferating cell, a glucose transporter, or anestrogen-related heat shock protein.

Classes of biomarkers and probes include, but are not limited to: (a)morphologic biomarkers, including DNA ploidy, MACs and premalignantlesions; (b) genetic biomarkers including DNA adducts, DNA mutations andapoptotic indices; (c) cell cycle biomarkers including cellularproliferation, differentiation, regulatory molecules and apoptosismarkers, and; (d) molecular and biochemical biomarkers includingoncogenes, tumor suppressor genes, tumor antigens, growth factors andreceptors, enzymes, proteins, prostaglandin levels and adhesionmolecules.

A “disease state” may be any cell-based disease. In some embodiments thedisease state is cancer. In other embodiments, the disease state is aninfectious disease. The cancer may be any cancer, including, but notlimited to epithelial cell-based cancers from the pulmonary, urinary,gastrointestinal, and genital tracts; solid and/or secretory tumor-basedcancers, such as sarcomas, breast cancer, cancer of the pancreas, cancerof the liver, cancer of the kidneys, cancer of the thyroid, and cancerof the prostate; and blood-based cancers, such as leukemias andlymphomas. Exemplary cancers which may be detected by the presentinvention are lung, bladder, gastrointestinal, cervical, breast orprostate cancer. Exemplary infectious diseases which may be detected arecell-based sieases in which the infectious organism is a virus,bacteria, protozoan, parasite, or fungus. The infectious disease, forexample, may be HIV, hepatitis, influenza, meningitis, mononucleosis,tuberculosis and sexually transmitted diseases (STDs), such aschlamydia, trichomonas, gonorrhea, herpes and syphilis.

As used herein, the term “generic disease state” refers to a diseasewhich comprises several types of specific diseases, such as lung cancer,sexually transmitted diseases and immune-based diseases. Specificdisease states are also referred to as histologic types of diseases. Forexample, the term “lung cancer” comprises several specific diseases,among which are squamous cell carcinoma, adenocarcinoma, large cellcarcinoma, small cell lung cancer and mesothelioma. The term “sexuallytransmitted diseases” comprises several specific diseases, among whichare Gonorrhea, Human Papilloma Virus (HPV), herpes and Syphilis. Theterm “immune-based diseases” comprises several specific diseases, suchas systemic lupus erythematosus (Lupus), rheumatoid arthritis andpernicious anemia.

As used herein, the term “high-risk population” refers to a group ofindividuals who are exposed to disease-causing agents, e.g.,carcinogens, either at home or in the workplace (i.e., a “high riskpopulation” for lung cancer might be exposed to smoking, passive smokingand occupational exposure). Individuals in a “high-risk population” mayalso have a genetic predisposition.

The term “at-risk” refers to individuals who are asymptotic but, becauseof a family history or significant exposure are at a significant risk ofdeveloping a disease state (i.e., an individual at risk for lung cancerwith a >30 pack-year history of smoking; “pack-year” is a measurementunit computed by multiplying the number of packs smoked per day, timesthe number of years for this exposure).

Cancer is a disease in which cells divide without control due to, forexample, altered gene expression. In the methods and panels of thepresent invention, the cancer may be any malignant growth in any organ.For example, the cancer may be lung, bladder, gastrointestinal,cervical, breast or prostate cancer. Each cancer may comprise acollection of diseases or histological types of cancer. The term“histologic type” refers to cancers of different histology. Depending onthe cancer there can be one or several histologic types. For example,lung cancer includes, but is not limited to, squamous cell carcinoma,adenocarcinoma, large cell carcinoma, small cell carcinoma andmesothelioma. Knowledge of the histologic type of cancer affecting apatient is very useful because it helps the medical practitioner tolocalize and characterize the disease and to determine the optimaltreatment strategy.

Infectious diseases include cell-based diseases in which the infectiousorganism is a virus, bacteria, protozoan, parasite or fungus.

Exemplary detection and discrimination panels are panels that detectlung cancer, a general disease state, and panels that discriminate asingle lung cancer type, specific disease state, against all other typesof lung cancer and false positives. False positives can includemetastatic cancer of a different type, such as metastasized liver,kidney or pancreatic cancer.

3. Methods of Making a Panel

The method of making a panel for detecting a generic disease state ordiscriminating between specific disease states in a patient involvesdetermining the sensitivity and specificity of binding of probes to alibrary of markers associated with a generic or specific disease stateand selecting a plurality of said probes whose pattern of binding(localization and density/concentration) is diagnostic of the presenceor specific nature of the disease state. In some embodiments, optionalpreliminary pruning and preparation steps are performed. The method ofmaking a panel of the present invention involves analyzing the patternof binding of probes to markers in known histologic pathology samples,i.e. gold standards. The classifier designed on the gold standard datacan then be used to design a classifier for cytometry, especiallyautomated cytometry. Therefore, the set of marker probes selected fromthe pathology analysis is used to prepare a new training data set takenfrom a cytology sample, such as sputum, fine needle aspirations, urine,etc. Cells shed from the specified lesions will stain in a similarfashion to the gold standards. The method described here eliminates theexperimental error in selecting the best features set because theintegrity of the diagnosis based on gold standard histologic pathologysamples is high. Although it is, in principle, possible to use cytologysamples to produce a panel, this is less desireable because cytologysamples contain debris, there may be deterioration of the cells in acytology sample, and the pathology diagnosis may be difficult to confirmclinically.

A library of markers is a group of markers. The library can comprise anynumber of markers. However, in some embodiments the number of markers inthe library is limited by technical and/or commercial practicalities,such as specimen size. For example, in some embodiments, each specimenis tested against all of the markers in the panel. Therefore, the numberof markers must not be larger than the number of samples into which thespecimen may be divided. Another technical practicality is time.Typically, the library contains less than 60 markers. Preferably, thelibrary contains less than 50 markers. More preferably, the librarycontains less than 40 markers. Most preferably the library contains10-30 markers. It is preferable that the library of potential panelmembers contain more than 10 markers so that there is opportunity tooptimize the performance of the panel. As used herein, the term “about”means plus or minus 3 markers.

In some embodiments, a library is obtained by consulting sources whichcontain information about various markers and correlations between themarkers and generic/specific disease states. Exemplary sources includeexperimental results, theoretical or predicted analyses and literarysources, such as journals, books, catalogues and web sites. Thesevarious sources may use histology or cytology and may rely oncytogenetics, such as in situ hybridization; proteomics, such asimmunohistochemistry; cytometry, such as MACs or DNA ploidy; and/orcytopathology, such as morphology. The markers may be localized anywherein or on a cell. For example, the markers may be localized in or on thenucleus, the cytoplasm or the cell membrane. The marker may also belocalized in an organelle within any of the aforementionedlocalizations.

In some embodiments, the library may be of an unsuitable size.Therefore, one or more pruning steps may be required prior to initiatingthe basic method for making a panel. The pruning step may involve one orseveral successive pruning steps. One pruning step may involve, forexample, setting an arbitrary threshold for sensitivity and/orspecificity. Therefore, any marker whose experimental or predictedsensitivity and/or specificity falls below the threshold may be removedfrom the library. Other exemplary pruning steps, which may be performedalone or in sequence with other pruning steps, may rely on detectiontechnology requirements, access constraints and irreproducibility ofreported results. With respect to detection technology requirements, itis possible that the machinery required to detect a particular marker isunavailable. With respect to access constraints, it is possible thatlicensing restrictions make it difficult or impossible to obtain a probethat binds to a particular marker. In some embodiments, a due diligencestudy is performed on each marker.

In some embodiments, prior to beginning the basic method for making apanel, it may be necessary to perform preparation steps. Exemplarypreparation steps include optimizing the protocols for objectivequantitative detection of the markers in the library and collectinghistology specimens. Optimization of the protocols for objectivequantitative detection of the markers is within the skill of an ordinaryartisan. For example, the necessary reagents and supplies must beobtained, such as buffers, reagents, software and equipment. It ispossible that the concentration of reagents may need to be adjusted. Forexample, if non-specific binding is observed, a person of ordinary skillin the art may dilute the concentration of the probe solution.

In some embodiments, the histology specimens are Gold Standards. Theterm “Gold Standard” is known by a person of ordinary skill in the artto mean that the histology and clinical diagnosis of the specimen isknown. The gold standards are often referred to as a “training” dataset. The gold standards comprise a set of measurements, or reliableestimates, of all the features that may contribute to the discriminatingprocess. Such features are collected from samples collected from arepresentative number of patients with known disease states. Thestandard samples can be cytology samples but this is less desireable forpanel selection.

The histology samples may be obtained by any technique known to those ofskill in the art, for example biopsy. In some embodiments, it isnecessary that the size of the specimen per patient be large enough sothat enough tissue sections can be obtained to test each marker in thelibrary.

In some embodiments, specimens are obtained from multiple patientsdiagnosed with each specific disease state. One specimen per patient maybe obtained, or multiple specimens per patient may be obtained. Inembodiments in which multiple specimens are obtained from individualpatients, the expertise of the surgeon is relied upon to establish thateach specimen obtained from a single patient is similar to the otherspecimens obtained from that patient. Specimens are also obtained from acontrol group of patients. The control group of patients may be healthypatients or patients that are not suffering from the generic or specificdisease state that is being tested.

The first step of the basic method is determining the sensitivity andspecificity of binding of probes to a library of markers associated withthe desired disease state. In this step, a probe that is specific foreach marker in the library is applied to a sample of the patients'specimens. Therefore, in some embodiments, if there are, for example, 30markers in the library, each patient's specimen will be divided into 30samples and each sample will be treated with a probe that is specificfor one of the 30 markers. The probe contains a label that may bevisualized. Therefore, the pattern and level of binding of the probe tothe marker can be detected. The pattern and level of binding may bedetected either quantitatively, i.e., by an analytical instrument, orqualitatively, by a human, such as a pathologist.

In some embodiments, an objective and/or quantitative scoring method isdeveloped to detect the pattern and level of binding of the probe to themarkers. The scoring method may be heuristically designed. Scoringmethods are used to objectify a subjective interpretation, for example,by a pathologist. It is within the skill of an ordinary artisan todetermine a suitable scoring method. In some embodiments, the scoringmethod may comprise categorizing features, such as the density of amarker probe stain as: none, weak, moderate, or intense. In anotherembodiment, these features may be measured with algorithms operating onmicroscope slide images. An exemplary scoring method is one in which theproportions and density are consolidated into a single “H Score”obtained by grading the intensity as: none=0, weak=1, moderate=2,intense=3, and the percentage cells as: 0-5%=0, 6-25%=1, 26-50%=2,51-75%=3, >75%=4, and then multiplying the two grades together. Forexample, 50% weakly stained plus 50% moderately stained would score6=(1×2)+(2×2). The “H score” honors the late Kenneth Hirsch, one of thepresent inventors.

An ordinary artisan is capable of addressing issues related tominimizing potential biases related to pathologists and samples. Forexample, randomizing may be used to minimize the chance of having asystematic error. Blinding may be used to eliminate experimental biasesby the people conducting the experiments. For example, in someembodiments, pathologist-to-pathologist variation may be minimized byconducting a double blind study. As used herein, the term “double blindstudy” is a well establish method for avoiding biases, where the datacollection and data analysis are done independently. In otherembodiments, sample-to-sample variation is minimized by randomizing thesamples. For example, the samples are randomized before the pathologistanalyzes them. There is also randomization involved in the experimentalprotocols. In some embodiments, each sample is analyzed by at least twopathologists. For each patient, a reliable assessment of the binding ofthe probe to the marker is obtained. In one embodiment, this diagnosisis made by qualified pathologists, using two pathologists per patient,to check for reliability.

A sufficient number of samples should be collected to produce reliabledesigns and reliable statistical performance estimates. It is within theskill of a normal artisan to determine how many samples are sufficientto produce reliable designs and reliable statistical performanceestimates. Most standard classifier design packages have methods fordetermining the reliability of the performance estimates and the samplesize should be progressively increased until reliable estimates areachieved. For example, sufficient estimates to produce reliable designsmay be achieved with 200 samples collected and 27 different featuresestimated from each sample.

The second step is selecting a limited plurality of probes. Theselecting step may employ statistical analysis and/or patternrecognition techniques. In order to perform the selecting step, the datamay be consolidated into a database. In some embodiments, the probes maybe numbered to render their method of action as unseen during theanalysis of their effectiveness and further minimize biases. Rigorousstatistical techniques are used because of the large amount of data thatis generated by this method. Any statistical method may be used and anordinary skilled statistician will be able to identify which and howmany methods are appropriate.

Any number of statistical analysis and/or pattern recognition methodsmay be employed. Since the structure of the data is initially unknown,and since different classifier design methods perform better fordifferent structures, it is preferred to use at least two design methodson the data. In some embodiments, three different methodologies may beused. One of ordinary skill in the art of statistical analysis and/orpattern recognition of data sets would recognize from characteristics ofthe data set structures that certain statistical methods would be morelikely to yield an efficient result than others, where efficient in thiscase means achieving a certain level of sensitivity and specificity witha desired number of probes. A person of ordinary skill in the art wouldknow that the efficiency of the statistical analysis and/or method isdata dependent.

Exemplary statistical analysis and/or pattern recognition methods aredescribed below:

a) A decision tree method, known as C4.5. C4.5 is public domain softwareavailable via ftp from http://www.cse.unsw.edu.au/˜quinlan/. This iswell suited to data that can be best classified by sequentially applyinga decision threshold to specific features in turn. This works best withuncorrelated data; it also copes with data with similar means providedthe variances differ. The C4.5 package was used to provide the examplesshown herein.

b) Linear Discriminant Analysis. This involves finding weightedcombinations of the features that give the best separation of theclasses. These methods work well with correlated data, but not in datawith similar means and different variances. Several statistical packageswere used (SPSS, SAS and R), depending on the performance estimates andgraphical outputs required. Fisher's linear discriminant function wasused to obtain the classifier that minimized the error rate. A canonicaldiscriminant function was used to compute receiver operatingcharacteristic (ROC) curves showing the trade-off between sensitivityand selectivity as the decision threshold is changed.

c) Logistic Regression. This is a non-linear transformation of thelinear regression model: the dependent variable is replaced by a logodds ratio (logit). Linear regression, like discriminant analysis,belongs to a class of statistical methods founded on linear models. Suchmodels are based on linear relationships between the explanatoryvariables.

With a sufficient number of samples it is possible, using the abovetechniques and software packages, to search for combinations of featuresgiving good discrimination between the classes. Other exemplarystatistical analysis and/or pattern recognition methods are the linearDiscriminant Function Method in SPSS and Logistic Regression Method in Rand SAS. SPSS is the full product name and is available from SPSS, Inc.,located at SPSS, Inc. Headquarters, 233 S. Wacker Drive, 11th floor,Chicago, Ill. 60606 (www.spss.com). SAS is the full product name and isavailable from SAS Institute, Inc., 100 SAS Campus Drive, Cary, N.C.27513-2414, USA (www.sas.com). R is the full product name and isavailable as Free Software under the terms of the Free SoftwareFoundation's GNU (General Public License).

http://www.r-project.org/.

In some embodiments, a correlation matrix is obtained. Correlationmeasures the amount of linear association between a pair of variables. Acorrelation matrix is obtained by correlating the data obtained with onemarker to data obtained with another marker. A threshold correlationnumber may be set, for example, 50% correlation. In this case, allmarkers with a correlation number of 50% or higher would be consideredcorrelate markers.

In some embodiments of the present invention, user supplied weightingfactors may be used to obtain optimized panels. Weighting may be relatedto any factor. For example, certain markers may be weighted higher thanothers due to cost, commercial considerations, misclassifications orerror rates, prevalence of a generic disease state in a geographiclocation, prevalence of a specific disease state in a geographiclocation, redundancy and availability of probes. Some factors related tocost that may encourage a user to weight certain markers higher thanothers is the cost of the probe and commercial access issues, such aslicense terms and conditions. Some factors related to commercialconsiderations that may encourage a user to weight certain markershigher than others are Research and Development (R&D) time, R&D cost,R&D risk, i.e., the probability that the probe will work, cost of finalanalytical instrument, final performance and the time to market. In adetection panel, for example, some factors related to misclassificationsor error rates that may encourage a user to weight some markers higherthan others is that it may be desirable to minimize false negatives. Ina discrimination panel, on the other hand, it may be desirable tominimize false positives. Some factors related to prevalence of ageneric or specific disease state in a geographic area that mayencourage a user to weight some probes higher than others are that insome geographic locations the incidence of certain generic or specificdiseases are more or less prevalent. With respect to redundancies, insome instances it is desirable to have redundancies in the panel. Forexample, if for some reason one probe fails to be detected, due to thebiological variability of the markers in the panel, a disease state willstill be detected by the other markers. In some embodiments, markersthat are preferred redundant markers may be weighted more heavily.

The invention is flexible in being adaptable to the availability offeatures where cost or supply problems may not allow the very bestcombination. In one embodiment, the invention can simply be applied tothe available features to find an alternative combination. In anotherembodiment, the algorithm is used to select features that allow costweightings to be included in the selection process to arrive at aminimum cost solution. In the examples, marker performance estimates forcombinations selected from all the markers collected or for only a groupof commercially preferred probes are shown. The examples alsodemonstrate how the C4.5 package can be used to down weight certainprobes on the basis of their high cost. These probe combinations may notperform as well as the optimum combination, but the performance might beacceptable in circumstances where cost is a significant factor.

Some of the methods used allow weightings to be applied to the classes.This is available in C4.5 where the tree design can optimize the cost.Also, the Discriminant Function method gives a single parameter outputwhich can be used to give a desired false positive or false negativeprobability. A plot of these parameters for different threshold settingsis known as the receiver operating characteristic (ROC) curve. An ROCcurve shows the estimated percentage of false positive against truepositive scores for different threshold levels of a classifier.

Given the heterogeneous nature of many generic disease states, thepanels may be constructed with a degree of redundancy to ensure that thetests have sufficient sensitivity, specificity, positive predictivevalue (Positive Predictive Value=True Positives/(True Positives+FalsePositives) and negative predictive value (Negative Predictive Value=Truenegatives/(False Negatives+True Negatives) to justify their use as apopulation-based screen. However, local and regional differences maydictate specific use of the tests in different segments of the globalmarket, and so may significantly influence the criteria used toconstruct the final panel test for a given market. While theoptimization of clinical utility is of utmost importance, local factorsincluding affordability (cost), technical competence, laboratory andhealthcare provider resources, workflow issues, manpower requirements,and availability of the probes and labels will contribute to a final,local selection of the markers used in the panel. Well known lineardiscriminant function analysis is used to include and assess allpotential selection factors, by which each local factor is representedby a term in the equation, and each is weighted according to its locallydetermined significance. In this way, a panel test optimized for use inone world region may differ from a panel test optimized for use in adifferent region.

Once detection or discrmination panels have been designed using theabove described method, the next step is to validate the panel usingknown cytology samples. Prior to validation, optional optimization stepsmay be performed. In some embodiments, the method for collectingcytology samples may be improved. This encompasses methods of obtainingthe sample from the patient as well as methods for mixing the cytologysample. In other embodiments, the cytology presentation methods may beimproved. For example, identifying optimal fixatives (preservationfluids) or transportation fluids.

The cytology samples used to validate the panels produced using the goldstandard histology samples are cytology samples with known diagnoses.These samples may be collected using any method known by those of skillin the art. For example, sputum samples can be collected by spontaneousproduction, induced production and through the use of agents thatenhance sputum production. The sample is contacted with each probe inthe panel and the level and pattern of binding of the probes is analyzedto determine the performance of the panel. In some embodiments, it maybe necessary to further optimize the panel. For example, it may benecessary to remove a probe from the panel. Or, it may be necessary toadd an additional probe to the panel. Additionally, it may be necessaryto replace one probe. on the panel with another probe. If a new probe isadded, this probe may be a correlate marker as determined from acorrelation matrix. Alternatively, the probe may be a functionallysimilar marker. Once the panel is optimized, the panel may proceed forfurther testing in clinical studies.

In other embodiments, it is not necessary to optimize the panel. If theresults with the cytology samples correlate with the results from thehistology samples, there may not be a need to optimize the panel and thepanel may proceed for further testing in clinical studies.

4. Methods of Use

Once a panel is obtained using the above described method, it may beapplied to cytologic samples. To illustrate the method, cancer,especially lung cancer, will be exemplified. Similar steps andprocedures will be appliced for other disease states. It is to beexpected that cells shed from the specified lesions will stain in asimilar fashion and show in a cytologic sample, such as a fine needaspiration, sputum, urine, in a similar fashion as in the histologicpathology samples used to obtain the panel.

The basic method of the present invention typically involves two steps.First, a cytological sample suspected of containing diseased cells iscontacted with a panel containing a plurality of agents, each of whichquantitatively binds to a disease marker. Then, the level or pattern ofbinding of each agent to a disease marker is detected. The results ofthe detection may be used to diagnose the presence of a generic diseaseor to discriminate among specific disease states. An optionalpreliminary step is identifying an optimized panel of agents that willaid in the detection of a disease or the discrimination between diseasestates in a cytologic sample.

Cytology specimens may include, but are not limited to, cellular samplescollected from body fluids, such as blood, urine, spinal fluids, andlymphatic systems; epithelial cell-based organ systems, such as thepulmonary tract, e.g., lung sputum, urinary tract, e.g., bladderwashings, genital tract, e.g., cervical PAP smears, and gastrointestinaltract, e.g., colonic washings; and fine needle aspirations from solidtissue sites in organs and systems such as the breast, pancreas, liver,kidneys, thyroid, bone marrow, muscles, prostate, and lungs; biopsiesfrom solid tissue sites in organs and systems such as the breast,pancreas, liver, kidneys, thyroid, bone marrow, muscles, prostate, andlungs; and histology specimens, such as tissue from surgical biopsies.

An illustrative panel of agents according to the present inventionincludes any number of agents that allows for accurate detection ofmalignant cells in a cytological sample. Molecular markers envisioned bythe present invention may be any molecule that aids in the detection ofmalignant cells. Markers may be selected for inclusion in a panel basedon several different criteria relating to changes in level or pattern ofexpression of the marker. Preferred are molecular markers where changesin expression: occur early in tumor progression, are exhibited by amajority of tumor cells, allow for detection of in excess of 75% of agiven tumor type, most preferably in excess of 90% of a given tumor typeand/or allow for the discrimination between histologic types of cancer.

The first step of the basic method is the detection of changes in thelevel or pattern of expression of the panel of agents in a cytologicalsample. This step typically involves contacting the cytologic samplewith an agent, such as a labeled polyclonal or monoclonal antibody orfragment thereof or a nucleic acid probe, and observing the signal inindividual cells. Detection of cells where there is a change in signalis indicative of a change in the level of expression of the molecularmarker to which the label probe is directed. The changes are based on anincrease or decrease in the level of expression relative to nonmalignantcells obtained from the tissue or site being examined.

An analysis of the changes in the level or pattern of expression of apanel of agents enables a skilled artisan to determine, with highsensitivity and high specificity, whether malignant cells are present inthe cytologic sample. The term “sensitivity” refers to the conditionalprobability that a person having a disease will be correctly identifiedby a clinical test, (the number of true positive results divided by thenumber of true positive and false negative results). Therefore, if acancer detection method has high sensitivity, the percentage of cancersdetected is high e.g., 80%, preferably greater than 90%. The term“specificity” refers to the conditional probability that a person nothaving a disease will be correctly identified by a clinical test, (i.e.,the number of true negative results divided by the number of truenegative and false positive results). Therefore, if a cancer detectionmethod has high specificity, 80%, preferably 90%, more preferably 95%,the percentage of false positives the method produces is low. A“cytologic sample” encompasses any sample collected from a patient thatcontains that patient's cells. Examples of cytological samplesenvisioned by the present invention include body fluids, epithelialcell-based organ system washings, scrapings, brushings, smears oreffusions, and fine-needle aspirates and biopsies.

Use of the markers described in this invention assumes that it ispossible to obtain an adequate cytologic sample routinely and that thesamples can be adequately preserved for subsequent evaluation. Thecytologic sample may be processed and stored in a suitable preservative.Preferably, the cytologic sample is collected in a vial containing thepreservative. The preservative is any molecule or combination ofmolecules known to maintain cellular morphology and inhibit or blockdegradation of cellular proteins and nucleic acids. To ensure properfixation, the sample may be mixed at the collection site at high speedsto disaggregate the sample and/or break up obscuring material such asmucus, thereby exposing the cells to the preservative.

Once a specimen is obtained, it is desirable to homogenize it, using anappropriate mixing device. This permits using aliquots for multiplepurposes, including the possibility of sending aliquots to more than onetesting site, as well as preparing multiple slides and/or multipledepositions on a slide. The initial homogenization of the specimen andof each aliquot before use will ensure that each individual slide willhave substantially the same distribution of cells, so that comparisonsof results from one slide to another will be meaningful.

Preparation of a specimen for analysis involves applying a sample to amicroscope slide using methods including, but not limited to, smears,centrifugation, or deposition of a monolayer of cells. Such methods maybe manual, semi-automated, or fully automated. The cell suspension maybe aspirated depositing the cells on a filter and a monolayer of cellstransferred to a prepared slide that may be processed for furtherevaluation. By repeating this process additional slides may be preparedas necessary. The present invention encompasses detection of onemolecular marker per slide. Detection of several molecular markers perslide is also envisioned. Preferably, 1-6 markers are detected perslide. In some embodiments 2 markers are detected per slide. In otherembodiments, 3 markers are detected per slide.

The present invention contemplates detecting changes in molecular markerexpression at the DNA, RNA or protein level using any of a number ofmethods available to an ordinary skilled artisan. Detection of thechanges in the level or pattern of expression of the molecular markersin a cytologic sample generally involves contacting a cytologic samplewith a polyclonal or monoclonal antibody or fragment thereof or anucleic acid sequence that is complementary to the nucleic acid sequenceencoding a molecular marker in the panel, collectively “probes”, and alabel. Typically, the probe and label components are operatively linkedso that when the probe reacts with the molecular marker a signal isemitted (a “labeled probe”). Labels envisioned by the present inventionare any labels that emit or enable a signal and allow for identificationof a component in a sample. Preferred labels include radioactive,fluorogenic, chromogenic or enzymatic moieties. Therefore, possiblemethods of detection include, but are not limited to,immunocytochemistry; proteomics, such as immunochemistry; cytogenetics,such as in situ hybridization, and fluorescence in situ hybridization;radiodetection, cytometry and field effects, such as MACs and DNA ploidy(the quantitation of stoichiometrically-stained nuclear DNA usingautomated computerized cytometry) and; cytopathology, such asquantitative cytopathology based on morphology. The signal generated bythe labeled probe is preferrably of sufficient intensity to permitdetection by a medical practitioner or technician.

Once the slide is prepared, a medical practitioner conducts amicroscopic review of the slides in order to identify cells that exhibita change in marker expression characteristic of a diagnosis of cancer.The medical practitioner may use an image analysis system and automatedmicroscope to identify cells of interest. Analysis of the data may makeuse of an information management system and algorithms that will assistthe physician in making a definitive diagnosis and select the optimaltherapeutic approach. A medical practitioner may also examine the sampleusing an instrument platform that is capable of detecting the presenceof the labeled agent.

A molecular diagnostic panel assay will result in one or more glassmicroscope slides with labeled cells and/or tissue sections. Thechallenge for human experts to assess these (cyto)pathologymultilabeled-cell preparations objectively and with clinicallymeaningful results is a virtually insurmountable detection andperception problem for any human being.

Computer-aided imaging systems (i.e., Photonic Microscopes™) can bedeveloped and used to assess quantitatively and reproducibly the amountand location of probe-labeled cells and tissues. Such PhotonicMicroscopes™ combine robotic slide-handling capabilities, datamanagement systems (e.g., medical informatics), and quantitative digital(optical and electronic) image analysis hardware and software modules todetect and report cell-based probe content and localization data thatcannot be obtained by human visualization with comparable sensitivityand accuracy. These probe data can be used to characterize anddifferentiate cellular samples based upon their related characteristicsand differences in their respective cell-based markers for a variety ofdisease states.

The present methodology is a methodology whereby the moleculardiagnostic panels are applied to cell-based specimens and samples, andwhereby computer-aided imaging systems are subsequently used to quantifyand report the results of the molecular diagnostic panel tests. Suchimaging systems can be used to evaluate cell-based samples in whichmultiple probes are used simultaneously on a given slide-based sample,and in which the probes can be separately analyzed, quantified, andreported because the probes are differentiated by color on themicroscope cytology or histology slide.

The signals generated by a labeled agent in the sample may, if they areof appropriate type and of sufficient intensity, be detected by a humanreviewer (e.g., pathologist) using a standard microscope or aComputer-Aided Microscope [167]. The Computer-Aided Microscope is anergonomic, computer-interfaced microscope workstation that integratesmouse-driven control of microscope operation (e.g., stage movement,focusing) with computerized automation of key functions (e.g., slidescanning patterns). A centralized Data Management System stores,organizes and displays relevant patient information as well as resultsfrom all specimen screenings and pathologist reviews. An identificationnumber that is imprinted onto barcodes and affixed to each sample slideuniquely identifies each sample in the database, and relates it to theoriginal specimen and the patient.

In a preferred embodiment the signals generated by a labeled agent inthe sample will be detected and quantitated using an automated imageanalysis system, or Photonic Microscope, interfaced to the centralizedData Management System. The Photonic Microscope provides fully automatedsoftware control of the microscope operations and incorporates detectorsand other components appropriate for quantitation even of signals notdetectable by human reviewers, such as very faint signals or signalsfrom radiolabeled moieties. The location of detected signals is storedelectronically for rapid relocation by automated instruments, and forhuman review using a Computer-Aided Microscope [168].

The centralized Data Management System archives all patient and sampledata using the bar-coded identification number. The data may be acquiredasynchronously, from a multiplicity of sites, and may be derived frommultiple reviews and analyses by human cytologists and/or automatedanalyzers. These data may include results from multiple sample slidesrepresenting aliquots from a single previously homogenized patientspecimen. Part or all of the data may be transferred to or from ahospital's Laboratory Information System to meet reporting, archiving,billing or regulatory requirements. A single, comprehensive report withintegrated results from panel tests and human reviews may be generatedand delivered to the physician in hardcopy, or electronically throughnetworked computers or the Internet.

In some embodiments, the instant method allows for differentialdiscrimination of different diseases, such as different histologic typesof cancers. The term “histologic type” refers to specific diseasestates. Depending on the general disease state there can be one orseveral histologic types. For example, lung cancer includes, but is notlimited to, squamous cell carcinoma, adenocarcinoma, large cellcarcinoma, small cell carcinoma and mesothelioma. Knowledge of thehistologic type of cancer affecting a patient is very useful because ithelps the medical practitioner to localize and characterize the diseaseand to determine the optical treatment strategy.

In order to determine the specific disease state, a panel of markers isselected that allows for discrimination between specific disease states.For example, within a panel of molecular markers, a pattern ofexpression may be identified that is indicative of a particularhistologic type of cancer. The detection of the level of expression ofthe panel of molecular markers is achieved by the above-describedmethods. Preferably, a panel of 1-20 molecular markers is employed todiscriminate among the various histologic types of lung cancer. However,most preferably, 4-7 markers are used. Decision trees may be developedto aid in discriminating between different histologic types based onpatterns of marker expression.

In addition to allowing for the detection of malignant cells in acytologic sample, the instant invention has utility in the molecularcharacterization of the disease state. Such information is often ofprognostic significance and can assist the physician in the selection ofthe optimal therapeutic approach for a particular patient. In addition,the panel of markers described in this invention may have utility inmonitoring the patient for either recurrence or to measure the efficacyof the therapy being used to treat the disease.

By way of non-limiting example, the presence of lung cancer may bedetected by a lung cancer detection panel and the specific type of lungcancer may be detected by a discrimination panel. If the medicalpractitioner determines that malignant cells are present in thecytologic sample, a further analysis of the histologic type of lungcancer may be performed. The histologic type of lung cancer encompassedby the present invention includes but is not limited to squamous cellcarcinoma, adenocarcinoma, large cell carcinoma, small cell carcinomaand mesothelioma. FIG. 1 illustrates molecular markers that arepreferable markers to be included in a panel for identifying differenthistologic types of lung cancer. The column labeled “%” indicates thepercentage of tumor specimens that express a particular marker.

In determining the various histologic types of lung cancer, the relativelevel of expression of a marker is analyzed. FIG. 2 illustrates howdifferent markers may be used to discriminate among different histologictypes of cancer. In this table, SQ indicates squamous cell carcinoma, ADindicates adenocarcinoma, LC indicates large cell carcinoma, SCindicates small cell carcinoma and ME indicates mesothelioma. Thenumbers appearing in each cell represent frequency of marker change inone cell type versus another. To be included in the table, the ratiomust be greater than 2.0 or less than 0.5. A number larger than 100generally indicates that the second marker is not expressed. In suchcases the denominator was set at 0.1 for the purpose of the analysis.Finally, empty cells represent either no difference in expression or theabsence of expression data.

One method for analyzing the data collected is to construct decisiontrees. Schemes 1-4 are examples of decision trees that may beconstructed to enable a differential determination of a histologic typeof lung cancer using the patterns of expression. The present inventionis in no way limited to the decision trees presented in Schemes 1-4. Therelative level of expression of a marker can be higher, lower, or thesame (ND) as the level of expression of the molecular marker in amalignant cell of a different histologic type. Each scheme enables adistinction between five histologic types of lung cancer through the useof the indicated panel of molecular markers.

For example, in Scheme 1 the panel consists of HERA, MAGE-3,Thrombomodulin and Cyclin D1. First the sample is contacted with alabeled probe directed toward HERA. If the expression of HERA is lowerthan the control, the test indicates that the histologic type of lungcancer is mesothelioma (ME). If, however, the expression is higher orthe same as the control, the sample is contacted with a probe directedtoward MAGE-3. If the expression of MAGE-3 is lower than the control,the sample is contacted with a labeled probe directed toward Cyclin D1and a determination of small cell carcinoma (SC) or adenocarcinoma (AD)is possible. If the expression of MAGE-3 is higher than or the same asthe control, the sample is contacted with a labeled probe directedtoward Thrombomodulin and a determination of squamous cell carcinoma(SC) or large cell carcinoma (LC) is possible.

In Scheme 2 the panel consists of E-Cadherin, Pulmonary Surfactant BandThrombomodulin. First the sample is contacted with a labeled probedirected toward E-Cadherin. If the expression of E-Cadherin is lowerthan the control, the test indicates that the histologic type of lungcancer is mesothelioma (ME). If, however, the expression is higher orthe same as the control, the sample is contacted with a probe directedtoward Pulmonary Surfactant B. If the expression of Pulmonary SurfactantB is lower than the control, the sample is contacted with a labeledprobe directed toward Thrombomodulin and a determination of squamouscell carcinoma (SQ) or large cell carcinoma (LC) is possible. If theexpression of Pulmonary Surfactant B is higher than or the same as thecontrol, the sample is contacted with a labeled probe directed towardCD44v6 and a determination of adenocarcinoma (AD) and small cellcarcinoma (SC) is possible. (See Schemes 3 and 4 for more examples ofdecision trees).

A preferred method involves using panels of molecular markers wheredifferences in the pattern of expression permits the discriminationbetween the various histologic type of lung cancer.

Many different decision trees may be constructed to analyze the patternsof marker expression. This information may be used by physicians orother health are providers to make patient management decisions andselect an optimal treatment strategy.

5. Reporting of Results of Panel Analysis

The results from the panel analysis may be reported in several ways. Forexample, the results may be reported as a simple “yes or no” result.Alternatively, the result may be reported as a probability that the testresults are correct. For example, the results from a detection panelstudy may indicate whether a patient has a generic disease state or not.As the panel also reports the specificity and sensitivity, the resultsmay also be reported as the probability that the patient has a genericdisease state. The results from a discrimination panel analysis willdiscriminate among specific disease states. The results may be reportedas a “yes or no” with respect to whether the specific disease state ispresent. Alternatively, the results may be reported as a probabilitythat a specific disease state is present. It is also possible to performseveral discrimination panel analyses on a specimen from one patient andreport a profile of the probabilities that the disease state present isa specific disease state with respect to the other possibilities. Theother possibilities may also include false positives.

In embodiments in which a profile of the probabilities of each specificdisease state being present is produced, there are several possibleoutcomes. For example, it is possible that all of the probabilities willbe a very small probability. In this instance, it is possible that thedoctor will conclude that the patient's specimen diagnosis is a falsepositive. It is also possible that all of the probabilities will be lowexcept for one that is above 80-90%. In this instance, it is possiblethat the doctor will conclude that the test verifies that the patienthas the specific disease state that indicated the high probability. Itis also possible that most of the probabilities will be low, butsimilarly high probabilities are reported for two specific diseasestates. In this case, a doctor may recommend more extensive paneltesting to ensure that the correct disease state is identified. Anotherpossibility is that all of the probabilities reported will be low, withone being slightly higher than the rest but not high enough to be in the80-90% range. In this case, a doctor may recommend more extensive paneltesting to ensure that the correct disease state is identified and/or torule out metastatic cancer from a remote primary tumor of a differentcancer type.

The following Example is illustrative of the method of the invention forselecting a disease detection panel, disease discrimination panels,validation of the panels and use of the panels in the clinic to screenfor a disease and to discriminate among different subtypes of thedisease. Lung cancer was selected for this illustrative example, in partbecause of its importance to world health, but it will be appreciatedthat similar procedures will apply to other types of cancer, as well asto infectious, degenerative and autoimmune diseases, according to theforegoing general disclosure.

ILLUSTRATIVE EXAMPLE

The present method was used to develop lung cancer detection panels aswell as single lung cancer type specific discrimination panels. Lungcancer is an extremely complex collection of diseases that can besegregated into two main classes. Non-small cell lung carcinoma (NSCLC)that accounts for approximately 70 to 80% of all lung cancers can befurther subdivided into three main histologic types including squamouscell carcinoma, adenocarcinoma, and large cell carcinoma. The remaining20 to 30% of lung cancer patients present with small cell lung carcinoma(SCLC). In addition, malignant mesothelioma of the pleural space, candevelop in individuals exposed to asbestos and will often spread widelyinvading other thoracic structures. Different forms of lung cancer tendto localize in different regions of the lung, have different prognoses,and respond differently to various forms of therapy.

According to the latest statistics from the World Health Organization(Globocan 2000), lung cancer has become the most common fatal malignancyin both men and women with an estimated 1.24 million new cases and 1.1million deaths each year. In the U.S. alone, the National CancerInstitute reports that there are approximately 186,000 new cases of lungcancer and each year 162,000 people die of the disease, accounting for25% of all cancer-related deaths. In the U.S., overall I-year survivalfor patients with lung cancer is 40%, however, only 14% live 5 years. Inother parts of the world, 5-year survival is significantly lower (5% inthe UK). The high mortality of lung cancer can be attributed to the factthat most patients (85%) are diagnosed with advanced disease whentreatment options are limited and the disease is likely to havemetastasized. In these patients, 5-year survival is between 2-30%depending of the stage at the time of diagnosis. This is in sharpcontrast to cases where patients are diagnosed early and 5-year survivalis greater than 75%. While it is true that a number of newchemotherapeutic agents have been introduced into clinical practice forthe treatment of advanced lung cancer, to date, none have yielded asignificant improvement in long-term survival. Even though patients withearly stage disease can presumably be cured by surgery, they remain atsignificant risk, as there is a high probability that they will developa second malignancy. Thus, for the lung cancer patient, early detectionand treatment followed by aggressive monitoring provides the best chanceof achieving significant improvements in long-term survival along with areduction in morbidity and cost.

At the present time, a patient is suspected of having lung cancer eitherbecause of a suspicious lesion on X-ray or because the patient becomessymptomatic. As a result, most patients are diagnosed with relativelylate stage disease. In addition, because most methods lack sufficientsensitivity with respect to the detection of early stage disease, thecurrent policy of the U.S. National Cancer Institute (NCI), NationalInstitutes of Health, recommends against screening for lung cancer evenin populations of patients who are at significant risk. In thisembodiment of the present invention, however, sputum cytology isemployed to provide a relatively noninvasive, more effective andcost-effective means for the early detection of lung cancer.

The specificity of sputum cytology is relatively high. Recent studieshave indicated that experienced cytotechnologists are able to recognizemalignant or severely dysplastic cells with a high degree of accuracyand reliability [10]. While the detection rate can be as high as 80 to90% when samples are collected from patients with a relatively advanceddisease [11,12], overall, sputum cytology has a sensitivity of only30-40% [13,14]. The low sensitivity of sputum cytology is particularlyimportant given that obtaining and preparing the specimen can berelatively expensive. Furthermore, failing to detect a malignancy cansignificantly delay treatment thereby reducing the chance of achieving acure.

The selection of an “at-risk” population can also influence the value ofsputum cytology as a screening tool. Individuals who are at significantrisk include those with a prior diagnosis of lung cancer, long-termsmokers or former smokers (>30 pack years) and individuals withlong-term exposure to asbestos or pulmonary carcinogens. People with agenetic predisposition or familial history are also included in an“at-risk” population. Such individuals are likely to benefit fromtesting. While the inclusion of individuals with lower risk may resultin an increase in the absolute number of cases detected, it would behard to justify the substantial increase in healthcare costs.

Other factors that contribute to the relatively poor performance ofconventional sputum cytology include the location of the lesion, tumorsize, histologic type, and the quality of the sample. Squamous-cellcarcinoma accounts for 31% of all primary pulmonary neoplasms. Most ofthese tumors arise from segmental bronchi and extend to the proximallobar and distal subsegmental branches [15]. For this reason, sputumcytology is reasonably effective (79%) in detecting these lesions.Currently, squamous cell carcinoma is viewed as the only type of lungcancer that is amenable to cytologic detection in an in situ andradiologically occult stage [15], as sloughed cells are more likely tobe available for evaluation. In one large study where patients werefollowed with both chest X-ray and sputum cytology, 23% of all lungcancers were detected by cytology alone, suggesting that the tumors wereearly stage and radiologically occult [16]. In another study [17],sputum cytology detected 76% of patients with radiologically occulttumors.

In the case of adenocarcinoma, 70% of tumors occur in the periphery ofthe lung making it less likely that malignant cells will be found in aconventional sputum specimen. For this reason, adenocarcinomas arerarely detected by sputum cytology (45%) [12,18,19], an importantconsideration, since the incidence of adenocarcinoma appears to beincreasing, particularly in women [20-22].

Tumor size can also affect the likelihood of achieving a correctdiagnosis, a factor that is particularly important when considering ascreening test for the detection of disease in asymptomatic individuals.While there is only a 50% chance that tumors <24 mm will be read as atrue positive, the probability of detecting a larger lesion is in excessof 84% [12].

Recent reports also indicate that the cellularity of the specimen willaffect the sensitivity of sputum cytology [14,23]. In general, patientswith squamous cell carcinoma produce specimens with significant numbersof tumor cells, thereby increasing the likelihood of a correct diagnosis[14,23]. For patients with adenocarcinoma, the presence of tumor cellsin a sputum specimen is reported to be less than 10% in 95% of thespecimens and less than 2% in 75% of specimens, making the diagnosissignificantly more difficult.

The degree of differentiation can also influence the ability of apathologist to detect malignant cells, particularly in cases ofadenocarcinoma. Well-differentiated tumor cells frequently resemblenonneoplastic respiratory epithelial cells. In the case of small-celllung carcinoma, sputum samples often contain nests of loosely aggregatedcells that have a distinct appearance. However, techniques currentlyused to process sputum samples tend to disaggregate the cells, making adiagnosis more difficult.

Sample quality is another factor that can contribute to the lowsensitivity of sputum cytology. Recent reports suggest that it ispossible to obtain adequate samples from 70-85% of subjects. However,achieving this measure of success often requires that patients providemultiple specimens [13]. This procedure is inconvenient, time-consumingand costly. Patient compliance is also generally low, as patients arefrequently asked to collect over several days [13]. Of equal importanceis the observation that former smokers, while at significant risk fordeveloping lung cancer, often fail to produce an adequate specimen.Sample preservation and processing is another critical factor that canaffect the value of sputum cytology as a diagnostic test.

Lastly, even if adequate samples could be obtained and optimallyprepared, cytotechnologists generally still have to review 2-4 slidesper specimen, each typically taking up to four minutes [24]. Given thelow sensitivity, high technical complexity and labor intensity ofconventional sputum cytology, it is not surprising that this test hasbeen almost universally rejected as a population-based screen for theearly detection of lung cancer [25].

Even if these technical issues were resolved, the low sensitivity ofsputum cytology remains a significant problem. The high incidence offalse negative results can significantly delay the patient receivingpotentially curative therapy. While it may be possible to develop testswith greater sensitivity, such improvements must not come at the cost ofspecificity. An increase in the number of false positive results wouldsubject patients to unnecessary, often invasive and costly, follow-upand would have a negative impact on the patient's quality of life. Thepresent invention overcomes many of the limitations associated withprevious methods of early cancer detection, including those related tothe use of sputum cytology for the early detection of lung cancer.

Lung cancer is a heterogeneous collection of diseases. To ensure that atest has the necessary level of sensitivity and specificity to justifyits use as a population based screen, the present invention envisionsusing, for example, a library of 10 to 30 cellular markers to developpanels. Selection of the library of this invention was based on a reviewand reanalysis of the relevant scientific literature where, in mostcases, marker expression was measured in biopsy specimens taken frompatients with lung cancer in an attempt to link expression withprognosis.

For example, a preferred panel for early detection, characterization,and/or monitoring of lung cancer in a patient's sputum may includemolecular markers for which a change in expression occurred in at least75% of tumor specimens. An exemplary panel includes markers selectedfrom VEGF, Thrombomodulin, CD44v6, SP-A, Rb, E-Cadherin, cyclin A, nm23,telomerase, Ki-67, cyclin D1, PCNA, MAGE-1, Mucin, SP-B, HERA, FGF-2,C-MET, thyroid transcription factor, Bcl-2, N-Cadherin, EGFR, Glut-1,ER-related (p29), MAGE-3 and Glut-3. A most preferred panel includesmolecular markers for which a change in expression occurs in more than85% of tumor specimens. An exemplary panel includes molecular markersselected from Glut1, HERA, Muc-1, Telomerase, VEGF, HGF, FGF,E-cadherin, Cyclin A, EGF Receptor, Bcl-2, Cyclin D1 and N-cadherin.With the exception of Rb and E-cadherin, a diagnosis of lung cancer isassociated with an increase in marker expression. A brief description ofthe library of probes/markers utilized in the present example isprovided below in Table 4. It is noted that the numbering of theantibodies in the table below is consistent with the number of theantibodies/probes/markers throughout this TABLE 4 Probes and Markers forLung Panel No. Marker Abbreviation Full Name of Antibody Probe TargetMarker Name/Description 1 VEGF anti-VEGF Vascular Endothelial GrowthFactor protein 2 Thrombomodulin anti-Thrombomodulin trams-membraneglycoprotein 3 CD44v6 anti-CD44v6 cell surface glycoprotein (CD44variant 6 gene): cell adhesion molecule 4 SP-A anti-SurfactantApoprotein A pulmonary surfactant apoprotein 5 Retinoblastomaanti-Retinoblastoma gene product phosphoprotein 6 E-Cadherinanti-E-Cadherin transmembrane Ca⁺⁺ dependent cell adhesion molecule 7Cyclin A anti-Cyclin A protein subunit of cyclin-dependent kinaseenzymes: for cell cycle regulation 8 nm23 anti-nm23 2 closely relatedproteins produced by nm23-H1 and - H2 genes 9 Telomerase anti-Telomeraseribonucleoprotein enzyme for chromosome repair 10 Mib-1 (Ki-67)anti-Ki-67 nuclear protein: expressed in proliferating cells 11 CyclinD1 anti-Cyclin D1 protein subunit of cyclin-dependent kinase enzymes:for cell cycle regulation 12 PCNA anti-Proliferating Cell NuclearAntigen protein cofactor for DNA polymerase delta 13 MAGE-1anti-Melanoma-Associated Antigen 1 cell recognition protein coded byMAGE family of genes 14 Mucin 1 (MUC-1) anti-Mucin 1 cell surface andsecreted mucin (highly glycosylated protein) 15 SP-B anti-matureSurfactant Apoprotein B pulmonary surfactant apoprotein 16 HERAanti-Human Epithelial Related Antigen cell surface antigen(transmembrane protein) (MOC-31) 17 FGF-2 (basic FGF) anti-FibroblastGrowth Factor protein that binds to cell surface 18 c-MET anti-c-METtrans-membrane receptor protein for Hepatocyte Growth Factor (HGF) 19Thyroid Transcription anti-TTF-1 regulator of thyroid-specific genes:also expressed in lung Factor 1 20 BCL-2 anti-BCL2 intracellularmembrane-bound protein encoded by BCL2 gene 21 P120 anti-p120Proliferation-Associated Nucleolar Antigen protein 22 N-Cadherinanti-N-Cadherin transmembrane Ca⁺⁺ dependent cell adhesion molecule 23EGFR anti-EGFR Epidermal Growth Factor Receptor: transmembraneglycoprotein 24 Glut 1 anti-Glut 1 Glucose-transporting, transmembraneGlut family of proteins 25 ER-related (p29) anti-ER-related P29:anti-HSP 27 Estrogen Receptor-related p29 protein: Heat Shock protein 2726 Mage 3 anti-Melanoma-Associated Antigen 3 cell recognition proteincoded by MAGE family of genes 27 Glut 3 anti-Glut 3Glucose-transporting, transmembrane Glut family of proteins 28 PCNA(higher dilution) anti-Proliferating Cell Nuclear Antigen proteincofactor for DNA polymerase delta

Each molecular marker in the preferred panel is described below. Table5, reciting the percentage of expression of the markers in tissue foreach type of lung cancer is provided at the end of this section.

Glucose Transporter Proteins (Glut 1 and Glut 3) [26-28]

Glucose Transporter-1 (Glut 1) and Glucose Transporter-3 (Glut-3) are aubiquitously expressed high affinity glucose transporter. Tumor cellsoften display higher rates of respiration, glucose uptake, and glucosemetabolism than do normal cells, and the elevated uptake of glucose intumor cells is thought to be mediated by glucose transporters.Overexpression of certain types of GLUT isoforms has been reported inlung cancer. The cellular localization of Glut 1 is in the cellmembrane. GLUT-1 and GLUT-3 are disease markers useful for detection ofa disease state.

Malignant cells exhibit an increase in glucose uptake that appears to bemediated by a family of glucose transporter proteins (Gluts). Oncogenesand growth factors appear to regulate the expression of these proteinsas well as their activities. Members of the Glut family of proteinsexhibit different patterns of distribution in various human tissues andrapid proliferation is often associated with their overexpression.Recent evidence suggests that Glut1 is expressed by a large percentageof NSCLC and by a majority of SCLC.

While the expression of Glut 3 is relatively low in both NSCLC and SCLCa significant percentage (39.5%) of large cell carcinomas express theprotein. In stage I tumors, 83% express Glut1 at some level with 75-100%of cells staining in 25% of cases. These data would suggest that Glut1overexpression is a relatively early event in tumor progression. Glut1immunoreactivity has also been detected in >90% of stage II and IIIAcancers. There also appears to be an inverse correlation between Glut1and Glut3 immunoreactivity and tumor differentiation. Tumors expressinghigh levels of Glut1 appear to be particularly aggressive that areassociate with a poor prognosis. In cases were tumors were negative forthe proteins better survival was observed.

Human Epithelial Related Antigen (HERA) [29,30]

HERA is a transmembrane glycoprotein with an, as yet, unknown function.HERA is present on most normal and malignant epithelia. Recent reportssuggest that the while HERA expression is high in all histologic typesof NSCLC making it useful as a detection marker. In contrast HERAexpression is absent in mesothelioma and thus suggesting would haveutility as a discrimination marker. The cellular localization of HERA isthe cell surface.

Basic Fibroblast Growth Factor (FGF) [31-34]

Basic Fibroblast Growth Factor (FGF) is a polypeptide growth factor witha high affinity for heparin and other glycosaminoglycans. In cancer, FGFfunctions as a potent mitogen, plays a role in angiogenesis,differentiation, and proliferation, and is involved in tumor progressionand metastasis. FGF overexpression frequently occurs in both SCLC andsquamous cell carcinoma. In many cases (62%), the cells also express theFGF receptor suggesting the presence of an autocrine loop. Forty-eightpercent of Stage 1 tumors overexpress FGF. The frequency of FGF in StageII lung cancer is 84%. Expression of either the growth factor or itsreceptor was associated with the poor prognosis. Five-year survivalrates for those patients with stage I disease were 73% for thoseexpressing FGF versus 80% for those who were FGF negative. The cellularlocalization is the cell membrane.

Telomerase [35-42]

Telomerase is a ribonucleoprotein enzyme that extends and maintainstelomeres of eukaryotic chromosomes. It consists of a catalytic proteinsubunit with reverse transcriptase activity and an RNA subunit withreverse transcriptase activity and an RNA subunit that serves as thetemplate for telomere extension. Cells that do not express telomerasehave successively-shortened telomeres with each cell division, whichultimately leads to chromosomal instability, aging and cell death. Thecellular localization of telomerase is nuclear.

Expression of telomerase appears to occur in immortalized cells andenzyme activity is a common feature of the malignant phenotype.Approximately 80-94% of lung tumors exhibit high levels of telomeraseactivity. In addition, 71% of hyperplasia, 80% of metaplasia, and 82% ofdysplasia express enzyme activity. All the carcinoma in situ (CIS)specimens exhibit enzyme activity. The low levels of expression inpremaligant tissues is probably related to the fact that only a smallpercentage of cells (5 and 20%) in the sample express enzyme activity.This is in contrast to tumors where 20-60% of cells may express enzymeactivity. Based on a limited number of samples it would appearexpression of telomerase activity is also common in SCLC.

Proliferating Cell Nuclear Antigen (PCNA) [43-51]

PCNA functions as a cofactor for DNA polymerase delta. PCNA is expressedin both S phase of the cell cycle and during periods of DNA synthesisassociated with DNA repair. PCNA is expressed in proliferating cells ina wide range of normal and malignant tissues. The cellular localizationof PCNA is nuclear.

Expression of PCNA is a common feature of rapidly dividing cells and isdetected in 98% of tumors. Immunohistochemical staining is nuclear withmoderate to intense staining detected in 83% of NSCLC. Intense PCNAstaining was observed in 51% of p53-negative tumors. However, when bothPCNA (>50% of cells staining) and p53 are overexpressed (>10% of cellsstained) the prognosis tends to be poorer with a shorter time toprogression. Although frequently detected in all stages of lung cancer,intense staining for PCNA is more common in metastatic disease.Thirty-one percent of CIS also overexpress PCNA.

CD44 [51-58]

CD44v6 is a cell surface glycoprotein that acts as a cellular adhesionmolecule. It is expressed on a wide range of normal and malignant cellsin epithelial, mesothelial and hematopoietic tissues. The expression ofspecific CD44 splice variants has been shown to be associated withmetastasis and poor prognosis in certain human malignancies. It isexpected to be used for detection and discrimination between squamouscell carcinoma and adenocarcinoma. CD44 is a cell adhesion molecule thatappears to play a role in tumor invasion and metastasis. Alternativesplicing results in the expression of several variant isoforms. CD44expression is generally lacking in SCLC and is variably expressed inNSCLC. Highest levels of expression occur in squamous cell carcinoma,thus making it valuable in discriminating between tumor types. Innon-neoplastic tissue, CD44 staining is observed in bronchial epithelialcells, macrophages, lymphocytes, and alveolar pneumocytes. There was nosignificant correlation between CD44 expression and tumor stage,recurrence, or survival particularly when overexpression occurs in earlystage disease. In metastatic lesions 100% of squamous cell carcinoma and75% of adenocarcinoma showed strong CD44v6 positivity. These data wouldtend to indicate that changes in CD44 expression occur relatively latein tumor progression that could limit its value as an early detectionmarker. Recent findings suggest that the CD44v8-10 variant is expressedby a majority of NSCLC making it a possible candidate marker.

Cyclin A [59-62]

Cyclin A is a regulatory subunit of the cyclin-dependent kinases (CDK's)which control the transition points at specific phases of the cellcycle. It is detectable in S-phase and during progression into G2 phase.The cellular localization of Cyclin A is nuclear.

Protein complexes consisting of cyclins and cyclin-dependent kinasesfunction to regulate cell cycle progression. Changes in cyclinexpression are associated with genetic alterations affecting the CCDN1gene. While the cyclins act as regulatory molecules, thecyclin-dependent kinases function as catalytic subunits activating andinactivating Rb.

Immunohistochemical analysis has revealed that the overexpression of thecyclins is associated with an increase in cellular proliferation asindicated by a high Ki-67 labeling index. Cyclin overexpression occursin 75% of NSCLC and appears to occur relatively early in tumorprogression. Recent reports indicate that 66.7% of stage I/II and 70.9%of stage III tumors overexpress Cyclin A. Nuclear staining is common inpoorly differentiated tumors. Expression of cyclin A is often associatedwith a decrease in mean survival time and a tendency towards thedevelopment of drug resistance. However, increased expression has alsobeen associated with a greater response to doxorubicin.

Cyclin D1 [63-73]

Cyclin D1, as with Cylcin A, is a regulatory subunit of thecyclin-dependent kinases (CDK's) which control the transition points atspecific phases of the cell cycle. Cyclin D1 regulates the entry ofcells into S phase of the cell cycle. This gene is frequently amplifiedand/or its expression deregulated in a wide range of human malignancies.The cellular localization of Cyclin D1 is nuclear.

Like Cyclin A, cyclin D1 functions to regulate cell cycle progression.Staining of cyclin D1 is predominately cytoplasmic and independent ofhistologic type. Reports suggest that cyclin D1 overexpression occurs in40-70% of NSCLC and 80% of SCLC. Cyclin D1, staining was observed in37.9% of stage I, 60% stage II, and 57.9% of stage III tumors. Cyclin D1expression has also been seen in dysplastic and hyperplastic tissueproviding evidence that these changes occur relatively early in tumorprogression. Patients who overexpress cyclin D1 exhibit shorter meansurvival time and lower five-year survival rate.

Hepatocyte Growth Factor Receptor (C-MET) [74-77]

C-MET is a proto-oncogene that encodes a transmembrane receptor tyrosinekinase for HGF. HGF is a mitogen for hepatocytes and endothelial cells,and exerts pleitrophic activity on several cell types of epithelialorigin. The cellular localization of C-MET is the cell surface.

Hepatocyte growth factor/scatter factor (HGF/SF) stimulates a broadspectrum of epithelial cells causing them to proliferate, migrate, andcarry out complex differentiation programs including angiogenesis.HGF/SF binds to a receptor encoded by the c-MET oncogene. While bothnormal and malignant tissues express the HGF receptor, expression ofHGF/SF appears to be limited to malignant tissue.

While the human lung generally expresses low levels of HGF/SF,expression increases markedly in NSCLC. Using Western blot analysis,88.5% of lung cancers exhibited an increase in the protein expression.All histologic types of tumors expressed the protein at increasedconcentrations. While increased levels of protein occur in all stages ofthe disease, recent evidence suggests that in addition to the cancercells, stromal cells and/or inflammatory cells may be responsible forthe production of the growth factor.

Mucin—MUC-1 [78-82]

Mucin-1 comes from a family of highly glycosylated secretory proteinswhich comprise the major protein constituents of the mucous gel whichcoats and protects the tracheobronchial tree, gastrointestinal tract andgenitourinary tract. Mucin-1 is atypically expressed in epithelialtumors. The cellular localization of Mucin-1 is cytoplasm and the cellsurface.

Mucins are a family of high molecular weight glycoproteins that aresynthesized by a variety of secretory epithelial cells that are eithermembrane bound or secreted. Within the respiratory tract, these proteinscontribute to the mucus gel that coats and protects thattracheobronchial tree. Changes in mucin expression commonly occur inconjunction with malignant transformation including lung cancer.Evidence exists suggesting at these changes may contribute toalterations in cell growth regulation, recognition by the immune system,and the metastatic potential of the tumor.

Although normal lung tissue expresses MUC-1, significantly higher levelsof expression are found in lung cancer with highest levels occurring inadenocarcinoma. Staining appears to occur independently of stage and ismore common in smokers than in former smokers or nonsmokers. Somepremalignant lesions also exhibit increased MUC-1 expression.

Thyroid Transcription Factor-1 (TTF-1) [83,84]

TTF-1 belongs to a family of homeodomain transcription factors thatactivate thyroid-specific and pulmonary-specific differentiation genes.The cellular localization of TTF-1 is nuclear.

TTF-1 is a protein originally found to mediate the transcription ofthyrogtobulin. Recently, TTF-1 expression was also found in thediencephalon and brohchioloalveolar epithelium. Within the lung TTF-1functions as a transcription factor regulating the synthesis ofsurfactant proteins and clara secretory protein. Overexpression of TTF-1occurs in a large proportion of lung adenocarcinomas and can aid indistinguishing between primary lung cancer and cancers that metastasizeto the lung. Adenocarcinomas that express TTF-1 and are cytokeratin 7positive and cytokeratin 20 negative can be detected with 95%sensitivity.

Vascular Endothelial Growth Factor (VEGF) [33,61,85-89]

VEGF plays an important role in angiogenesis, which promotes tumorprogression and metastasis. There are multiple forms of VEGF; the twosmaller isoforms are secreted proteins and act as diffusible agents,whereas the larger two remain cell associated. The cellular localizationof VEGF is cytoplasmic, cell surface, and extracellular matrix.

Vascular Endothelial Growth Factor (VEGF) is an important angiogenesisfactor and endothelial cell-specific mitogen. Angiogenesis is animportant process in the latter stages of carcinogenesis, tumorprogression and is particularly important in the development of distantmetastasis. VEGF binds to a specific receptor Flt that is often presentin the tumors expressing the growth factor suggesting the presence of anautocrine loop.

Immunohistochemical analysis reveals that cells expressing VEGF exhibita pattern of staining that is diffuse and cytoplasmic. While notexpressed by nonneoplastic cells, VEGF is present in the majority ofNSCLC and in a smaller percentage of SCLC. Several reports have shownhigh levels of VEGF in early stage lung cancer.

Expression of VEGF has been associated with an increased frequency ofmetastasis. Studies have shown that VEGF expression is indicative of apoor prognosis and shorter disease-free interval in adenocarcinoma butnot in squamous cell carcinoma. Three year and five year survival ratesin the group expressing high levels of VEGF were 50% and 16.7% ascompared to 90.9 and 77.9% respectively for the low VEGF group.

Epidermal Growth Factor Receptor (EGFR) [90-104]

Epidermal Growth Factor Receptor (EGFR) is a transmembrane glycoprotein,which can bind and become activated by various ligands. Bindinginitiates a chain of events that result in DNA synthesis, cellproliferation, and cell differentiation. EGFR has been demonstrated in abroad spectrum of normal tissues, and EGFR overexpression is found in avariety of neoplasms. Increased expression has been observed inadenocarcinomas of the lung and large cell carcinomas but not in smallcell lung carcinomas. The cellular localization of EGFR is the cellsurface.

The EGFR plays an important role in cell growth and differentiation. TheEGFR is uniformly present in the basal cell layer but not in more thesuperficial layers of histologically normal bronchial epithelium. Withthis exception, there is no consistent staining of normal tissue. Recentevidence suggests that the overexpression of the EGF receptor may not bean absolute requirement for the development of invasive lung cancer.However, it appear that in cases where EGFR overexpression occurs it isa relatively early event with greater staining intensity in moreadvanced disease.

For patients with invasive carcinomas, 50-77% of tumors stain for EGF.Overexpression of the EGFR is more common in squamous cell carcinomathan in adenocarcinoma and common in SCLC. Highest levels of EGFR occurin conjunction with late stage and metastatic disease that haveapproximately twice the concentration of EGFR as that seen in stage I/IItumors. Estimates suggest that the level of the EGFR observed in stage Itumors is approximately twice that seen in normal tissue. In addition,48% of bronchial lesions also show EGFR staining including, metaplasia,atypia, dysplasia, and CIS. In the “normal” bronchial mucosa, of thesesame cancer patients, overexpression of the EGFR was observed in 39% ofcases but was absent in the bronchial epithelium of the non-cancer. Inaddition, overexpression of the EGFR occurs more frequently in thetumors of smokers than in nonsmokers, particularly in the case ofsquamous cell carcinoma.

While several studies have suggested that overexpression of the EGFR isassociated with the poor prognosis, other studies have failed to makethis correlation.

Nucleoside Diphosphate Kinase/nm23 [105-111]

Nucleoside diphosphate kinase (NDP kinase)/nm23 is a nucleosidediphosphate kinase. Tumor cells with high metastatic potential oftenlack or express only a low amount of nm23 protein, hence the nm23protein has been described as a metastasis suppressor protein. Thecellular localization of nm23 is nuclear and cytoplasmic.

Expression of nm23/nucleoside diphosphate/kinase A (nm23) is a marker oftumor progression where there is an inverse relationship betweenexpression and metastatic potential. In cases where stage I tumorsoverexpress nm23, no evidence of metastasis was seen during an averagefollow-up period of 35 months. Immunohistochemical analysis revealsstaining that is diffuse, cytoplasmic and generally limited to malignantcells. Alveolar macrophages also express the protein. Given that highlevels of expression are associated with a low metastatic potential,there is currently no explanation as to why normal epithelial cells donot express nm23.

Intense staining has been observed in high percentage of NSCLCparticularly large cell lung cancer and 74% of SCLC suggesting that thisprotein plays an important role in tumor progression. With the exceptionof squamous cell carcinoma, staining intensity tends to increase withstage. Based on the available evidence, it would appear that nm23 is aprognostic factor in both SCLC and NSCLC.

Bcl-2 [101,112-125]

Bcl-2 is a mitochondrial membrane protein that plays a central role inthe inhibition of apoptosis. Overexpression of bcl-2 is a common featureof cells in which programmed cell death has been arrested. The cellularlocalization of Bcl-2 is the cell surface.

Bcl-2 is a protooncogene believed to play a role in promoting theterminal differentiation of cells, prolonging the survival ofnon-cycling cells and blocking apoptosis in cycling cells. Bcl-2 canexist as a homodimers or can form a heterodimer with Bax. As ahomodimer, Bax functions to induce apoptosis. However, the formation ofa Bax-bcl-2 complex blocks apoptosis. By blocking apoptosis, bcl-2expression appears to confer a survival advantage upon affected cells.Bcl-2 expression may also play a role in the development of drugresistance. The expression of bcl-2 is negatively regulated by p53.

Immunohistochemistry analysis of bcl-2 reveals a heterogeneous patternof cytoplasmic staining. In adenocarcinoma, expression of bcl-2 wassignificantly associated with smaller tumors (<2 cm) and lowerproliferative activity. The expression of bcl-2 appears to be moreclosely associated with neuroendocrine differentiation and occurs in alarge percentage of SCLC.

Overexpression of bcl-2 is not present in preneoplastic lesionssuggesting that changes in bcl-2 occur relatively late in tumorprogression. In addition to tumor cells, bcl-2 immunostaining alsooccurs in basal cells and on the luminal surfaces of normal bronchiolesbut is generally not detected in more differentiated cell types.

Association of bcl-2 immunoreactivity with improved prognosis in NSCLCis controversial. Several reports of suggested that patients with tumorsexpressing bcl-2 have a superior prognosis and a longer time torecurrence. Several reports indicate that bcl-2 expression tends to belower in those patients who develop metastatic disease. For patientswith squamous cell carcinoma, expression of bcl-2 has been linked to animprovement in 5-year survival. However, in three relatively largestudies there was no survival benefit linked to bcl-2 expression,particularly for patients with early stage disease.

Estrogen Receptor-Related Protein (p29) [126]

ER related protein p29 is an estrogen-related heat shock protein thathas been found to correlate with the expression of estrogen-receptor.The cellular localization of p29 is cytoplasmic.

Estrogen-dependent intracellular processes are important in the growthregulation of normal tissue and may play a role in the regulation ofmalignancies. In one study expression of p29 was detected in 109 (98%)of 111 lung cancers. The relation between p29 expression and survivaltime was different for men and women. Expression of p29 was associatedwith poorer survival particularly in women with Stage I and II disease.There was no correlation between p29 expression and long-term survivalin men.

Retinoblastoma Gene Product (Rb) [68,73,123,127-141]

Retinoblastoma Gene Product (Rb) is a nuclear DNA-bindingphosphoprotein. Under phosphorylated Rb binds oncoproteins of DNA tumorviruses and gene regulatory proteins thus inhibiting DNA replication. Rbprotein may act by regulating transcription; loss of Rb function leadsto uncontrolled cell growth. The cellular localization of Rb is nuclear.

Retinoblastoma protein (pRb) is a protein that is encoded by theretinoblastoma gene and is phosphorylated and dephosphorylated in a cellcycle dependent manner. pRb is considered an important tumor suppressorgene that functions to regulate the cell cycle at G0/G1. In itshypophosphorylated state, pRb inhibits the transition from G1 to S.During G1, inactivation of the growth suppressive properties of pRboccurs when the cyclin dependent kinases (CDK's) phosphorylate theprotein. The hyperphosphorylation of pRb prevents it from forming acomplex with E2F that functions as a transcription factor proteins thatare required for DNA synthesis.

Inactivation of the retinoblastoma (Rb) gene has been documented invarious types of cancer, including lung cancer. Small-cell carcinomasfail to stain for pRb indicating loss of Rb function. Overall, 17.6% ofthe tumors fail to express pRb with no correlation being seen withrespect to stage or nodal status. A reduction in staining has also seenin 31% dysplastic bronchial biopsies. However, there appears to be nocorrelation between pRb expression and the severity of dysplasia. Incontrast, normal bronchial epithelium and cells taken from areasadjacent to tumors expressed pRb positive nuclei. These data suggestthat alterations in the expression of the Rb protein may arise early inthe development of some lung cancers.

Patients with Rb-positive carcinomas tend to have a somewhat betterprognosis but, in most studies, the difference is not significant.However, patients with adenocarcinoma whose tumors are both pRb negativeand either p53 or ras positive exhibit a decrease in 5-year survival. Asimilar relationship does not occur in squamous cell carcinoma. pRbnegative tumors have been reported to be more likely to exhibitresistant to doxorubicin than Rb-positive carcinomas.

Thrombomodulin [142-147]

Thrombomodulin is a transmembrane glycoprotein. Through its acceleratedactivation of protein C (which in turn acts as an anticoagulant bybinding protein S and thrombin), synthesis of TM is one of severalmechanisms important in reducing clot formation on the surface ofendothelial cells. The cellular localization of thrombomodulin is thecell surface.

Aggregation of host platelets by circulating tumor cells appears to playan important role in the metastatic process. Thrombomodulin plays animportant role in the activation of the anticoagulant protein C bythrombin and is an important modulator of intravascular coagulation. Inaddition to its expression in normal squamous epithelium, expression ofthrombomodulin also occurs in squamous metaplasia, carcinoma in situ,and invasive squamous cell carcinomas. Although present in 74% ofprimary squamous cell carcinomas, only 44% of metastatic lesions stainedfor thrombomodulin. These data suggest that, with progression, there isa decrease in thrombomodulin expression. Higher levels of expressiontend to occur in well and moderately differentiated tumors when comparedto poorly differentiated tumors.

Patients with thrombomodulin-negative squamous cell carcinoma tend tohave a worse prognosis. Eighteen percent of patients withthrombomodulin-negative have a five-year survival as compared to 60% incases where the tumors stained positive for the protein. Progression tometastatic disease was also more common in thrombomodulin-negativetumors (69% vs. 37%) and there was a greater tendency for these tumorsto develop at extrathorasic sites. Thus, loss of thrombomodulinexpression appears to be prognostic in cases of squamous cell carcinoma.The observation that changes in thrombomodulin expression occur in laterstages of NSCLC and that the protein is expressed by normal bronchialepithelial cells would tend to limit its utility as a marker for earlydetection. However, since a majority of mesotheliomas and only a smallpercentage of adenocarcinomas express thrombomodulin, the marker haspotential utility in discriminating between these two tumor types.

E-Cadherin & N-Cadherin [148-151]

E-cadherin is a transmembrane Ca2+ dependent cell adhesion molecule. Itplays an important role in the growth and development of cells via themechanisms of control of tissue architecture and the maintenance oftissue integrity. E-cadherin contributes to intercellular adhesion ofepithelial cells, the establishment of epithelial polarization,glandular differentiation, and stratification. Down-regulation ofE-cadherin expression has been observed in a number of carcinomas and isusually associated with advanced stage and progression. The cellularlocalization of E-cadherin is the cell surface.

E-cadherin is a calcium-dependent epithelial cell adhesion molecule. Adecrease in E-cadherin expression has been associated with tumordedifferentiation and metastasis and decreased survival. Reducedexpression has been observed in moderately and poorly differentiatedsquamous cell carcinoma and in SCLC. There was no change in E-cadherinexpression in adenocarcinoma. Furthermore, while adenocarcinomas expressE-cadherin theses tumors fail to express N-cadherin which is in contrastto mesotheliomas that express N-cadherin but not E-cadherin. Thus, thesemarkers can be used to discriminate between adenocarcinoma andmesothelioma.

Expression of E-cadherin can also be used to assess the prognosis ofpatients with squamous cell carcinoma. Whereas 60% of patients withtumors expressing E-cadherin survived three-year survival, only 36% ofpatients exhibiting a reduction in expression survived 3 years.

MAGE-1 and MAGE-3 [152-156]

Melanoma Antigen-1 (MAGE-1) and Melanoma Antigen-3 (MAGE-3) are membersof a family of genes that are normally silent in normal tissues but whenexpressed in malignant neoplasms are recognized by autologous,tumor-directed and specific cytotoxic T cells (CTL's). The cellularlocalization of MAGE-1 and MAGE-3 is cytoplasmic.

MAGE-1, MAGE-3 and MAGE 4 gene products are tumor-associated antigensthat are recognized by cytotoxic T lymphocytes. As such, they could haveutility as targets for immunotherapy in NSCLC. MAGE proteins are alsoexpressed by some SCLCs but not by normal cells. While the frequency ofMAGE expression falls below the level necessary for use as a detectionmarker, differences in the pattern of expression between histologictypes suggest that MAGE expression may have utility as differentiationmarkers. This utility is also supported by the observation that, in 50%of squamous cell carcinoma greater than 90% of tumor cells showedevidence of MAGE-3 overexpression with 30% to tumors exhibitingoverexpression in at least 50% of cells.

Nucleolar Protein (p120) [157]

p120 (proliferation-associated nucleolar antigen) is found in the cellsof nucleoli of rapidly proliferating cells during early G1 phase. Thecellular localization of p120 is nuclear.

Nucleolar protein p120 is a proliferation-associated protein whosefunction has yet to be elucidated. Strong staining has been detected intumor tissue but not in macrophages or normal tissue. Overexpression ofp120 was more common in squamous cell carcinoma that in adenocarcinomaor large cell carcinoma raising the possibility that this marker mayhave utility in discriminating between tumor types.

Pulmonary Surfactants [83,158-166]

Pulmonary surfactants are a phospholipid-rich mixture that functions toreduce the surface tension at the alveolar-liquid interface, thusproviding the alveolar stability necessary for ventilation. Surfactantproteins appear to be expressed exclusively in the airway and areproduced by alveolar type II cells. In the non-neoplastic lung,pro-surfactant-B immunoreactivity is detected in normal and hyperplasticalveolar type II cells and some non-ciliated bronchiolar epithelialcells. Sixty percent of adenocarcinomas contained strong cytoplasmicimmunoreactivity with 10-50% of tumor cells exhibiting staining themajority of cases. Squamous cell carcinoma and large cell carcinomafailed to stain for pro-surfactant-B.

Surfactant Apoprotein B (SP-B) is one in four hydrophobic proteins thatmake up the pulmonary surfactant, which is a phospholipid and proteincomplex secreted by type II alveolar cells. Squamous cell and large cellcarcinomas of the lung and nonpulmonary adenocarcinomas do not expressSP-B. The cellular localization of SP-B is cytoplasmic.

SP-A is a pulmonary surfactant protein that plays an essential role inkeeping alveoli from collapsing at the end of expiration. SP-A is aunique differentiation marker of pulmonary alveolar epithelial cells(type II pneumocytes); the antigen is preserved even in the neoplasticstate. The cellular localization of SP-A is cytoplasmic.

Pulmonary surfactant A appears to be specific for non-mucinousbronchoiolo-alveolar carcinoma with 100% staining as compared to none ofthe of mucinous type. Pulmonary surfactants potentially have utility indiscriminating lung cancer from other cancers metastasized to lung. Inaddition to tumor cells, non-neoplastic pheumocytes also stain forpulmonary surfactant A. As with pulmonary surfactant B staining forpulmonary surfactant A is relatively common in adenocarcinoma but not inother forms of NSCLC or in SCLC. Mesothelioma also fails to expresspulmonary surfactant A leading to the suggestion that pulmonarysurfactant A may have utility in the discrimination betweenadenocarcinoma and mesothelioma.

Ki-67

Ki-67 is a nuclear protein that is expressed in proliferating normal andneoplastic cells and is down-regulated in quiescent cells. It is presentin G1, S, G2, and M phases of the cell cycle, but is absent in Go phase.Commonly used as a marker of proliferation. The cellular localization ofKi-67 is nuclear. TABLE 5 Squamous Cell Large Cell Small Cell MarkerCarcinoma Adenocarcinoma Carcinoma Carcinoma Mesothelioma Glut1 100.0⁺64.5 80.5 64.0 NDA* Glut3 17.5 16.0 39.5 9.0 NDA* HERA 100.0 100.0 100.0NDA 4.5 Basic FGF 83.0 48.7 50.0 100.0 NDA Telomerase 82.3 86.3 93.066.7 NDA PCNA 80.0 69.8 87.7 51.0 NDA CD44v6 79.3 34.8 44.2 0.0 NDACyclin A 79.0 68.0 83.5 97.0 NDA Cyclin D1 42.7 36.0 62.0 90.0 NDAHepatocyte Growth 75.5 78.3 100.0 NDA 100.0  Factor/Scatter Factor MUC-155.5 90.0 100.0 100 NDA TTF-1 38.0 76.0 NDA 83.0 NDA VEGF 61.8 68.3100.0 43.5 NDA EGF Receptor 63.1 45.3 96.0 Frequently NDA nm23 68.0 52.683.5 73.5 NDA Bcl-2 45.5 43.3 42.5 92.0 NDA Loss of pRb Expression 20.125.8 35.4 85.3 NDA Thrombomodulin 66.8 12.2 4.0 0.0 81.0  E-cadherin69.0 85.0 NDA 100.0 0.0 N-cadherin NDA 4.0 NDA NDA 94.0  MAGE 1 45.035.0 NDA 16.5 NDA MAGE 3 72.0 33.3 NDA 33.5 NDA MAGE 4 45.5 11.0 NDA50.0 NDA Nucleolar Protein (p120) 68.0 35.0 30.0 NDA NDA PulmonarySurfactant B 0.0 61.5 0.0 NDA NDA Pulmonary Surfactant A 12.0 52.9 17.520 0.0⁺percent of tumors exhibiting a change in marker expression*No Data Available

a. Obtaining a Library of Marker of a Suitable Size

Preliminary pruning steps were required in order to obtain a suitablesize library of markers that were correlated with lung cancer. More thana hundred markers correlated to lung cancer are known in the literature.A partial listing of candidate probes identified in the literature andevaluated for potential inclusion in panels tests include antibodies to:bax, Bcl-2, c-MET (HGFr), CD44S, CD44v4, CD44v5, CD44v6, cdk2 kinase,CEA (carcino-embryonic antigen), Cyclin A, Cyclin D1 (bcl-1),E-cadherin, EGFR, ER-related (p29), erbB-1, erbB-2, FGF-2 (bFGF), FOS,Glut-1, Glut-2, Glut-3, Glut-4, Glut-5, HERA (MOC-31), HPV-16, HPV-18,HPV-31, HPV-33, HPV-51, integrin VLA2, integrin VLA3, integrin VLA6,JUN, keratin, keratin 7, keratin 8, keratin 10, keratin 13, keratin 14,keratin 16, keratin 17, keratin 18, keratin 19, A-type lamins (A; C),B-type lamins (B 1; B2), MAGE-1, MAGE-3, MAGE-4, melanoma-associatedantigen clone NKI/C3, mdm2, mib-1 (Ki-67), mucin 1 (MUC-1), mucin 2(MUC-2), mucin 3 (MUC-3), mucin 4 (MUC-4), MYC, N-cadherin, NCAM (neuralcell adhesion molecule), nm23, p120, p16, p21, p27, p53, P-cadherin,PCNA, Retinoblastoma, SP-A, SP-B, Telomerase, Thrombomodulin, ThyroidTranscription Factor 1, VEGF, vimentin, and wafl. The initial list ofmarkers was pruned by initially assessing, from the literature, theapparent effectiveness of the probes in detecting early stage cancercells, discriminating between cells of differing cancer states, andlocalizing the label to the target cancer cells. This list of markerswas further pruned by removing markers whose utilization would bedifficult to reduce to practice because they are difficult to produce orobtain, have unsuitable detection technology requirements or poorreproducibility of reported results. After all of the pruning steps werecomplete, a library of 27 markers was obtained.

b. Optimizing Protocols and Obtaining Gold Standard Lung Cancer Samples

Preliminary preparation steps were also required prior to obtaining thepanels. The probes containing appropriate labels were available fromcommercial vendors. The protocols of the probes were analyzed foroptimum objective quantitative detection. For example, it was determinedthat the concentration of PCNA was too low. Originally, PCNA was diluted1:4000 in S809 buffer. A second dilution was made, which was 1:3200 inS809. The optimized protocols for each marker is shown in below. It isnoted that the second column is labeled “Antibody Name”. Except forMOC-31, the probes in this list are listed by the marker name becausemany of the vendors refer to the antibody by the name of the marker. Itis noted that an alternative way these reagents might be listed is, forexample, anti-VEGF, anti-Thrombomodulin, anti-CD44v6, etc.

Gold standard tissue specimens were obtained from UCLA. Tissue specimenswere received from two sources. Cases had been diagnosed using standardprocedures including review of hematoxylin and eosin (H&E)-stainedslides and the clinical history. Specimen slides were coded and labeledwith arbitrary numbers to blind the study pathologists to the historicaldiagnosis and antibody marker and to protect patient confidentiality.

Specimen slides with tissue sections from cancerous and non cancerous(control) tissues were used. A total of 175 separate cases wereanalyzed. Within this set, the following diagnoses, located in Table 6were present with the following frequencies: TABLE 6 Diagnosis Number ofoccurrences Cancer Adenocarcinoma 25 Large Cell Carcinoma 18Mesothelioma 26 Small Cell Lung Cancer 20 Squamous Cell Carcinoma 24Control Emphysema 34 Granulomatous Disease 3 Interstitial Lung Disease25

c. Determination of the Level of Expression of the Panel of MolecularMarkers

Sufficient specimen slides were prepared for each case so that only oneprobe was tested per slide. In general, a microscope slide is preparedwhich contains the cytologic sample contacted with one or more labeledprobes that are directed at particular molecular markers. Independently,each study pathologists examined an H&E-stained slide to make adiagnosis for each case, and then examined each probe-reacted andimmunochemically-stained slide to assess the level of probe binding,recording the results on a standardized data form.

In greater detail, the immunohistochemical staining was performed onformalin fixed, paraffin embedded (FFPE) tissue. Tissue sections werecut at 4 microns thick on poly-L-Lysine coated slides and dried at roomtemperature overnight. De-paraffinization and rehydration of the tissuesections were performed as follows: To completely remove all of theembedding medium from the specimen the slides were incubated in twoconsecutive Xylene-substitute (Histoclear) baths for five minutes each.All liquid was tapped off the slides before incubation in twoconsecutive baths of 100% reagent grade alcohol for three minutes each.Once again all excess liquid was tapped off the slides before beingincubated in two final baths of 95% reagent grade alcohol for threeminutes each. After the last bath of 95% the slides were rinsed in tapwater and held in wash buffer (Tris-buffered saline wash buffercontaining 0.05% Tween 20 corresponding to a 1:10 dilution of DAKOAutostainer Wash buffer, code S3306). Table 7, below, presents acomplete list of the reagents used in this study along withcorresponding product code numbers. Detection systems used in the studywere DAKO EnVision+ HRP mouse (code K4007) or rabbit (code K4003) andLSAB+ HRP (code K0690). The protocols for immunoassaying werefollowed-according to the package inserts. The kits contained liquid twocomponent DAB+ substrate chromogen (code K3468). TABLE 7 Reagents usedin the Study Reagents Code # National Diagnostics HistoClear HS-200Mallinckrodt Reagent Alchohol Absolute 7019-10 DAKO Antibody DiluentS809 DAKO Background Reducing Antibody Diluent S3022 DAKO AutostainerBuffer 10× S3306 DAKO Target Retrieval Solution S1700 DAKO Hi pH TargetRetrieval Solution S3307 DAKO Proteinase K S3020 Rite Aid HydrogenPeroxide 3% None DAKO Protein Block Serum Free X0909 DAKO Goat SerumX0501 DAKO Swine Serum X0901 DAKO EnVision+ Mouse K4007 DAKO EnVision+Rabbit K4003 DAKO LSAB+ K0690 DAKO DAB+ K3468 DAKO Hematoxylin S3302Dakomount Mounting Media S3025 Instruments Serial Numbers DAKOAutostainers 3400-6613-03 3400-6142R-03 Autostainer IHC Software VersionV3.0.2

Pretreatments were critical in optimizing these antibodies on lungtissue. For antibodies requiring enzyme digestion, DAKO Proteinase K(code S3020) was used for 5 minutes at room temperature. Antibodiesrequiring heat induced target retrieval received pretreatment usingeither DAKO Target Retrieval Solution (code S1700) or DAKO High pHTarget Retrieval Solution (code S3307). Tissues were placed in apre-heated Target Retrieval Solution and incubated in a 95° C. waterbath for 20 or 40 minutes depending on the specific protocol. Tissuesections were then allowed to cool at room temperature for an additional20 minutes.

After de-paraffinization, rehydration and tissue pretreatment, allspecimens were incubated in a solution of 3% hydrogen peroxide to quenchendogenous peroxidase activity. Blocking reagents were used specificallyfor the two antibodies FGF and Telomerase in order to minimizenonspecific background.

As shown in Table 8, below, tissue specimens were incubated for aspecified length of time with 200 micro liters of the optimally dilutedprimary antibody. It is noted that the numbering of themarkers/antibodies in Table 8 is consistent with the numbering of theantibody probes and markers throughout this document. Slides were thenwashed in DAKO 1× Autostainer Buffer (code S3306). Depending on theantibody, the correct detection system was applied. The steps and totalincubation times for the DAKO EnVision+ HRP and LSAB+ HRP detectionsystems are shown in Table 9, below. The color reaction is developedusing 3,3′-diaminobenzidine (DAB) resulting in a brown color precipitateat the site of the reaction. TABLE 8 Antibodies for Lung Panel Antibodyto # Marker: Pretreatment Block Dilution Primary Inc Detection Sys CloneVendor Code# 1 VEGF Hi pH TRS 20 min None 1:15 in S809 30 minutes EnV+mouse JH121 NeoMarkers MS-350-P S3307 2 Thrombomodulin None None 1:100in S809 30 minutes EnV+ mouse 1009 DAKO M0617 3 CD44v6 TRS 20 min S1700None RTU 30 minutes EnV+ mouse VFF-7 NeoMarkers MS-1093-R7 4 SP-A NoneNone 1:200 is S809 30 minutes EnV+ mouse PE10 DAKO M4501 5Retinoblastoma TRS 40 min S1700 None 1:25 in S809 30 minutes EnV+ mouseRb1 DAKO M7131 6 E-Cadnerin TRS 20 min S1700 None 1:100 in S809 30minutes EnV+ mouse NCH-38 DAKO M3612 7 Cyclin A TRS 20 min S1700 None1:25 in S809 30 minutes EnV+ mouse 6E6 Novocastra NCL 117205 8 nm23 HipH TRS 20 min None 1:50 in S809 30 minutes EnV+ rabbit Polyclonal DAKOA0096 S3307 9 Telomerase TRS 20 min S1700 Prot Block 1:400 in S809Overnight EnV+ rabbit Polyclonal Alpha EST21-A X0909, Diagnostic 30 minw/5% goat serum X0501 10 Ki-67 TRS 40 min S1700 None 1:200 in S809 30minutes EnV+ mouse IVAK-2 DAKO M7240 11 Cyclin D1 Hi pH TRS 20 min None1:200 in S3022 30 minutes EnV+ mouse DCS-6 DAKO M7155 S3307 12 PCNADilution 1 TRS 20 min S170 None 1:4000 in S809 30 minutes EnV+ mousePC10 DAKO M0879 13 MAGE-1 Hi pH TRS 20 min None 1:250 in S809 30 minutesEnV+ mouse MA454 NeoMarkes MS 1067 S3307 14 Mucin 1 TRS 20 min S1700None 1:40 in S809 30 minutes EnV+ mouse VU4H5 Santa Cruz Sc-7313 Biotech15 SP-B TRS 20 min S1700 None 1:100 in S809 30 minutes EnV+ mouse SPB02NeoMarkes MS-1300-P1 16 HERA TRS 40 min S1700 None 1:50 in S809 30minutes EnV+ mouse MOC-31 DAKO M3525 17 FGF-2 None Prot Block 1:50 inS809 Overnight EnV+ mouse bFM-2 Upstate #05-118 X0909, Biotech 30 minw/5% swine serum X0901 18 C-Met Incomplete None Incomplete IncompleteEnV+ mouse 8F11 Novocastra 118406 19 TTF-1 TRS 40 min S1700 None 1:25 inS809 30 minutes EnV+ mouse 8G7G3/1 DAKO M3575 20 Bcl-2 Hi pH TRS 20 minNone 1:75 in S809 30 minutes EnV+ mouse 124 DAKO M0887 S3307 21 p120 TRA20 min S1700 None 1:10 in S809 30 minutes EnV+ mouse FB-2 BiogenexMU196-UC 22 N-Cadherin TRS 40 min S1700 None 1:75 in S809 30 minutesEnV+ mouse 6G4 & 6G11 DAKO N/A 23 EGFR Prot K 1:25 for None 1:1500 inS809 30 minutes EnV+ mouse 2-18C9 DAKO K1492 5 min 24 Glut 1 TRS 40 minS1700 None 1:200 in S809 30 minutes LSAB+ Polyclonal Santa Cruz SC 1605Biotech 25 ER-related (p29) TRS 40 min S1700 None 1:200 in S809 30minutes EnV+ mouse G3.1 Biogenex MU171-UC 26 Mage 3 TRS 40 min S1700None 1:20 in S809 30 minutes EnV+ mouse 57B G. Spagnoli N/A 27 Glut 3TRS 20 min S1700 None 1:80 in S809 30 minutes LSAB+ Polyclonal SantaCruz SC 7581 Biotech 28 PCNA Dilution 2 TRS 20 min S1700 None 1:3200 inS809 30 minutes EnV+ mouse PC10 DAKO M0879

TABLE 9 Detection Systems Used in the Study Steps 1 Deparafinization andrehydration 2 baths of Histoclear for 5 mins each 2 baths of 100%alchohol for 3 mins each 2 baths of 95% alchohol for 3 mins each WaterRinse 2 Pretreatments TRS 40 or 20 mins High pH TRS 20 mins Proteinase Kfor 5 mins Water Rinse 3 Peroxidase block Peroxide bath for 5 mins WaterRinse Buffer for 5 mins Protein Block for 30 mins after H2O2 Block 4Primary Ab 30 mins or Overnight at room temp 5 Detection System EnV+Systems Labelled Polymer OR LSAB+ System Secondary Reagent 30 mins 15mins Secondary Ab link Tertiary Reagent 15 mins SA-HRP 6 ChromogenChromogen Chromogen 10 mins DAB+  5 minsDAB+

Following immunostaining all slides were incubated in DAKO Hematoxylin(code S3302) for 3 minutes and coverslipped using DAKOMount MountingMedia (S3025). All protocols were run on DAKO Autostainers (serial #'s3400-6612-03 & 3400-6142R-03) using the IHC software version 3.0.2.

Immunostaining was viewed under a light microscope to determine thatcontrols were correctly stained and tissues were intact. Slides werelabeled, boxed and sent to designated pathologists for resultsinterpretation. Trained pathologists identified the type of cancer orother lesion seen in the samples. Trained pathologists assessed thesensitivity to the marker probe by estimating the staining density andproportion of cells stained. These scores were entered in a data sheetfor that patient. The pathologists were blinded to the originaldiagnosis and antibody marker used in the immunostaining. Each slide wasread by at least two pathologists and results recorded on a datacollection form. To provide additional integrity to the process, themethod is repeated with a second or third pathologist. The scoresobtained can then be matched to identify data entry errors. Theadditional data also facilitates a better classifier design.

For each case, up to 27 slides were analyzed, each stained for a markercoded with numbers 1 through to 17, 19 through to 28. Staining formarker 18 (C-MET) could not be optimized and the marker/probe wastherefore not used. Pathologist 1 scored slides from all 175 cases.Pathologist 2 scored slides from 99 of the cases. Pathologist 3 scoredslides from 80 of the cases.

Table 10 below shows how many cases of each diagnosis each pathologistscored slides from: TABLE 10 Pathol- Pathol- Pathol- ogist ogist ogistDiagnosis 1 2 3 Cancer Adenocarcinoma 25 12 14 Large Cell Carcinoma 18 99 Mesothelioma 26 14 8 Small Cell Lung Cancer 20 12 6 Squamous CellCarcinoma 24 13 11 Control Emphysema 34 23 13 Granulomatous Disease 3 32 Interstitial Lung Disease 25 13 17

For the purposes of some selected statistical analysis techniques, itwas necessary to consider only those cases that had scores for all 27slides present. Table 11 below shows how many cases of each diagnosiswere complete in terms of having scores from all 27 slides. TABLE 11Pathol- Pathol- Pathol- ogist ogist ogist Diagnosis 1 2 3 CancerAdenocarcinoma 14 10 8 Large Cell Carcinoma 12 9 3 Mesothelioma 17 13 3Small Cell Lung Cancer 7 9 1 Squamous Cell Carcinoma 12 13 4 ControlEmphysema 32 21 1 Granulomatous Disease 2 1 0 Interstitial Lung Disease23 7 3

From this table, it can be calculated that each pathologist scored thefollowing total number of complete cases. Pathologist 1 scored all 27slides for 119 of the cases Pathologist 2 scored all 27 slides for 83 ofthe cases. Pathologist 3 scored all 27 slides for 23 of the cases.

The total number of cancer data points is 172. This comprises 113 datapoints from Pathologist 1 and 60 data points from Pathologist 2. Thetotal number of control data points is 101. This comprises 62 datapoints from Pathologist 1 and 39 data points from Pathologist 2.

FIG. 3 shows a comparisons between H-scores for probes 7 and 15 incontrol tissue and in cancerous tissue. The x-axis shows the H-scoreswhile the y-axis shows the percent of cases with that particularH-score. The difference in H-scores is apparent.

For each patient the scores were entered electronically into a PathologyReview Form which-consolidates the scores into a data base showing thepatient identifier together with diagnosis, proportion of cells stained,and staining density. The proportions and density were consolidated intoa single “H-Score” obtained by grading the intensity as: none=0, weak=1,moderate=2, intense=3, and the percentage cells as: 0-5%=0, 6-25%=1,26-50%=2, 51-75%=3, >75%=4, and then multiplying the two gradestogether. For example, 50% weakly stained plus 50% moderate stainedwould score 10=2×2+2×3. This is the standard scoring system throughoutthe analysis, except for the section 3(f), below, titled “Effect ofUsing other (non-H-score) objective scoring parameters”, whichinvestigates alternative scoring systems.

Standard classification procedures were used to find the bestcombination of probes. Typically these use a search procedure such asthe “Branch and Bound Algorithm” to find a hierarchy of the bestfeatures, ranked according to a test of discriminating power, andtruncated according to a test of significance. This process also definesthe decision rule or rules for best classification.

The performance of a classifier designed with these features can beestimated from the data used to design the classifier. Thestraightforward application of all the design data to the classifiergives a very unsound estimate of performance.

The analysis of the data collected in the present example provide theoptimum selection of probes which provided the best separation ofclasses. Therefore, panels were obtained that only needed a few probesto perform the analysis. However the data showed that near-optimumperformance could be obtained with other combinations of probes. Hence,the invention is flexible in being adaptable to the availability ofprobes where cost or supply problems may not allow the very bestcombination. In some cases, the invention can simply be applied to theavailable features to find an alternative combination. In other cases,the algorithm may be used to select features which allows costweightings to be included in the selection process to arrive at a lowcost solution.

The design of data collection and analysis experiment was chosen toavoid biases through the well established double blind procedures wheredata collection and data analysis were done independently.

In the first case the pathologists reviewed slides with conventionalstaining to allow a diagnosis to be made. This diagnosis was entered onthe Pathology Review form. The pathologists were then presented, inrandom order, with slides stained by the marker probes for scoring thepercentage of cells stained and the relative intensity of the staining.The slides were numbered to exclude information about the probe from thepathologist. To allow data integrity to be checked two pathologistsreviewed all patients.

Data were consolidated into a database that was then reviewed by a teamof statisticians. Probes were numbered to render their method of actionas unseen during the analysis of their effectiveness.

The first stage of the analysis was to check the integrity of the databy comparing entries for each patient. Where large differences werefound, the data entries were checked and any obvious errors werecorrected. Unexplained differences were left in the data.

The data were then separately analyzed by four statisticians, usingdifferent techniques in recognition of the fact that differentstatistical methodologies are suited to different types ofdiscriminating information in the data.

The first step in the process of selecting the best probe combination isto divide the data into two sets, one for designing a classifier and onefor testing the performance of the classifier. By selecting thedesign-made with the design (train) set, but showing the bestperformance evaluated on the test set, it can be concluded withconfidence that the classifier has generalized to the structure of thedata and not adapted to particular cases seen in the training set.

In order to test for reliability the analysis was typically repeatedwith many randomly selected sets of training data and test data. Thisapproach is generally accepted as giving good estimates of theclassifier performance. Where these tests showed inconsistent selectionsof probes such probe selections were discounted as unreliable.

d. Statistical Analysis and/or Pattern Recognition

1. Introduction to Data Analysis

a. Input Data

i. Raw Data

For each patient the scores were entered electronically into a PathologyReview Form that consolidates the scores into a database showing thepatient identifier together with diagnosis, proportion of cells stained,and staining density.

ii. Computed Data

The efficiency of the score for each probe used in the analysis iscomputed from the intensity/percentage tables. The proportions anddensity are consolidated into a single “H-Score” with a simple ruleH=proportion stained×(3 if intense+2 if moderate+1 if weakly stained).This is the feature value associated with that probe.

iii. Alternative Computed Data Parameters

The H-score described above was heuristically derived, a simple analysisto find a better way of combining percentages and intensity failed toshow a significant improvement over H-score (Section 3(f), titled“Effect of Using other (non-H-score) objective scoring parameters”). Alarger data base may allow the extraction of a better rule in future.

iv. User Supplied Weighting Criteria Per Marker

The invention is flexible in being adaptable to the availability offeatures where cost or supply problems may not allow the very bestcombination. For example, the invention can simply be applied to theavailable features to find and alternative combination. Alternatively,the algorithm used to select features allows cost weightings to beincluded in the selection process to arrive at a minimum cost solution.Marker performance estimates are shown for combinations selected fromall the markers collected or only those from one supplier. It is alsoshown how the C4.5 package can be used to down weight certain probes,say on the basis of their high cost. These probe combinations do notperform as well as the optimum combination, but the performance might beacceptable in circumstances where cost is a significant factor.

v. User Supplied Weighting Criteria Per Class

Some of the methods used allow weightings to be applied to the classes.This is available in C4.5 where the tree design can optimize the cost.Also the Discriminant Function method gives a single parameter outputwhich can be used to give a desired false positive or false negativeprobability. A plot of these parameters for different threshold settingsis known as the Receiver Operating Curve.

vi. Detection Panels—Assumptions

A low probability of a false negatives was assumed to be desirable forthe cancer detection process (to avoid positive patients being missed atthe cost of an increased number of false positives who would requirere-screening). It was also assumed that the cancer discriminationprocess would require a lower false positive score (to minimize patientsreceiving the wrong treatment).

It was assumed that detection panels requiring 6 or more probes toachieve an acceptable performance would not be cost effective. It wasalso assumed that a detection panel with a false negative error rate ofmore than 5% would not be acceptable. Panels falling outside this boxare not accepted. This assumption acknowledges that cytometric panelsare likely to have a worse performance than the histology based panelsanalyzed here. The ultimate aim will be a cytometric panel whichperforms better than 20% error rate, this being approximately theperformance of cervical PAP smear screeners.

vii. Discrimination Panels—Assumptions

It was assumed that panels requiring 6 or more probes are not costeffective and it was assumed that an error rate of better than 20% isrequired. Panels falling outside this box were not accepted.

b. Output Data

Outputs provided by the present analysis included:

Confusion Matrices, showing how data from the test set was classified aseither true positive, false positive, true negative or false negative.These may be shown as actual counts or as percentages. Confusionmatrices are discussed in section 2(d) titled “Performance Metrics”. Aconfusion matrix shows how data from a test set was classified as eithertrue positive, false positive, true negative or false negative. Anexemplary confusion matrix, obtained from data analyzed by decisiontrees, is shown below in table 12 for simultaneous discrmination ofadenocarcinoma, squamous cell carcinoma, large cell carcinoma,mesothelioma and small cell carcinoma TABLE 12 Large Small AdenoSquamous Cell Mesothelioma Cell Adeno 67.74% 6.45% 19.40% 0.00% 6.45%Squamous Cell 2.94% 76.47% 11.67% 0.00% 8.82% Large Cell 28.00% 8.00%44.00% 8.00% 12.00% Mesothelioma 0.00% 25.64% 51.28% 89.74% 2.56% SmallCell 0.00% 3.85% 23.08% 3.85% 69.23%

-   -   Error Rates, summarizing data in the confusion matrix as the sum        of all false classifications divided by the total number of        classifications made expressed as a percentage    -   Receiver Operating Characteristic (ROC) curves show the        estimated percentage (or per unit probability) of false positive        and false negative scores for different threshold levels in the        classifier. An indifferent classifier, unable to discriminate        better than random choice would present a ROC curve with equal        true and false readings. The area under this curve would be 50%        (0.5 probability).    -   Area Under the Curve (AUC) is often used as an overall estimate        of classifier performance and most standard discriminant        function packages provide this AUC figure. A perfect classifier        would have 100% Area Under the Curve, and a useless classifier        would have an AUC near 50% (0.5).    -   Sensitivity and specificity (can be derived from the confusion        matrix). See section 2(d)(iii) titled “Sensitivity and        Specificity”.    -   Marker correlation matrices. See FIG. 4.

i. Detection Panels: Composition

These panels are trained on data divided into two classes, patients withany of the five cancers and patients with none of the cancers. Not allprobes were present for all patients. Where one or more probes weremissing for a particular analysis these cases were excised from thedata. Hence, where analysis was undertaken on reduced numbers of probesthe data set might include slightly more cases.

The number of probes included in the analysis was 27. Although in manycases a false probe was added where the data entered for that probe wasfrom a random number generator set to generate numbers uniformly betweenzero and 12. This false probe was included in much of the early analysisto ensure integrity in the probe selection process. This false probe wasalso used in one approach to progressively eliminate probes from theanalysis. Probes that contributed less information than the false probecould be readily identified and excluded from the selection process.Early elimination of such probes speeds the analysis and renders theanalysis less vulnerable to variations in results (noise) caused bythese probes.

ii. Detection Panel Performance

As outputs from this study, the probe combinations selected by thedifferent methodologies and their performance estimates in terms of theconfusion matrix, % error rate, and AUC are reported.

iii. Detection Panels—Alternative Compositions

Detection panels were also selected from reduced sets of probes. In oneset of panels, performance measures of panels weighted for commerciallypreferred markers were obtained. The performances obtained when the bestprobe was removed from the analysis to find a new combination ofdiscriminating probes was also analyzed. The performance of a singleprobe acting on its own was found to be very high (probe 7). However, asshown below in the performance diagrams, Table 13, evaluated usinglinear discriminant analysis, the performance was improved as moremarkers were added. The best subsets of probes were determined usingbest subsets logistic regression. The improvement is statisticallysignificant. TABLE 13 Cancer Control Probe 7 Cancer 87.93% 12.07%Control 0.00% 100.00% Probes 7 and 16 Cancer 93.10% 6.90% Control 1.16%98.84% Probes 7, 15 and 16 Cancer 90.52% 9.48% Control 1.16% 98.84%Probes 1, 7, 15, and 16 Cancer 90.52% 9.48% Control 0.00% 100.00% Probes1, 4, 7, 15, and 16 Cancer 92.24% 7.76% Control 1.16% 98.84%

The best and second best subsets of probes (determined using bestsubsets logistic regression) and evaluated using logistic regression isshown below. AUC=Area under ROC curve. It is noted that mean AUC is theaverage from 100 trials on random train and test partitions (70%:30%).The results are shown below, in Table 14. TABLE 14 Probes Mean AUC  794.28 28 80.14 7, 16 95 7, 15 94.59 7, 15, 16 95.94 1, 7, 16 95.33 1, 7,15, 16 95.61 4, 7, 15, 16 95.34 1, 4, 7, 15, 16 95.3 1, 7, 11, 15, 1695.57

iv. Discrimination Panels—Composition

For this part of the study five classifiers were designed and tested,each designed to detect the presence of one of the cancer from allpatients with cancer. The application of this five way pair-wise systemallows doubtful cases to appear more than once in the analysis, or notat all. Such cases can be identified and subjected to closer scrutiny,re-testing or alternative testing regimes.

Again the number of probes in the study was 27, with a false probe usedin the early stage to reduce the numbers in the analysis

v. Discriminant Panels—Performance

The performance estimators described above were used to show theperformance of the best probe combinations discovered by the differenttechniques

vi. Discriminant Panels—Alternative Composition

The analysis was repeated for a probe combination comprisingcommercially preferred probes. Performance was degraded, but notunusable for several reduced-set classifiers. Below, the best subsets ofprobes without probe 7, determined using best subsets logisticregression), is shown, as Table 15. The data was evaluated using lineardiscriminat analysis. TABLE 15 Cancer Control Probe 28 Cancer 0.7068970.293103 Control 0.093023 0.906977 Probes 10 and 28 Cancer 0.7931030.206897 Control 0.034884 0.965116 Probes 10, 15 and 28 Cancer 0.8103450.189655 Control 0.011628 0.988372 Probes 1, 10, 15 and 28 Cancer0.827586 0.172414 Control 0.011628 0.988372 Probes 1, 10, 15, 16 Cancer0.827586 0.172414 and 28 Control 0.011628 0.988372

The best and second best subsets of probes with probe 7 (determinedusing best subsets logistic regression) and evaluated using logisticregression is shown below. AUC=Area under ROC curve. It is noted thatmean AUC is the average from 100 trials on random train and testpartitions (70%:30%). The results are shown below, in Table 16. TABLE 16Probes Mean AUC 28 79.36% 10 82.28% 10, 28 94.21% 15, 28 88.68% 10, 15,28 92.90% 1, 10, 28 93.59% 1, 10, 15, 28 92.99% 8, 10, 15, 28 93.20% 1,10, 15, 16, 28 93.13% 1, 8, 10, 15, 28 93.57%

2. Data Analysis Methodology

In this section, the process of gaining an initial understanding of thestructure of the data as a guide to interpreting results from thedifferent methodologies used is described.

a. Analysis of Variance

i. Pathologist-to-Pathologist Variability and Pooling PathologistScores.

(1) t—Test

Two pathologists reviewed each patient's slides in this clinical trial.Pathologist 1 reviewed all patients, Pathologist 2 also reviewedapproximately half of this set and Pathologist 3 reviewed the remainder.With two independent estimates of the H-score, the consistency ofpathologist performance could be tested.

A readily available statistical tool was used to test the variabilitybetween pathologists. This is the paired-sample t-test. This takes thedifference between each pair of estimates, averages these and expressesthis as a proportion of the overall variances. The t-test then convertsthis ratio into a probability estimating the likelihood that the twosamples sets came from the same population (the P value).

This test was applied to the scores for each marker probe, for all casesreviewed by Pathologist 1 and Pathologist 2, and also for all casesreviewed by Pathologist 1 and Pathologist 3. Since there were 27 testsapplied (to cover all probes) a low value of P=0.01 was selected as the“significant threshold”. Results, showing the P scores for each probe,and for the two pairs of pathologists, are shown below, in Tables 17,18,19 and 20. It is clear that Pathologist 1 and Pathologist 2 were moreconsistent than Pathologist 1 and Pathologist 3. TABLE 17 Pathologist 1,Pathologist 2 scores: X1 X2 X3 X4 X5 X6 X7 0.5875446 0.010518470.4659704 0.4659704 0.3772894 0.2307273 0.01001357 X8 X9 X10 X11 X12 X13X14 0.004131056 0.7703014 0.1640003 0.2374452 0.9580652 0.15878760.001200265 X15 X16 X17 X18 X19 X20 X21 0.19742 0.3860899 0.3829022 NA0.544601 0.08873848 0.1686243 X22 X23 X24 X25 X26 X27 X28 0.54284510.1912477 0.4031977 0.2477236 0.5673386 0.9174037 0.00339071

TABLE 18 Pathologist 1, Pathologist 2 scores thresholded at 0.01 (α = 1%level of significance): X1 X2 X3 X4 X5 X6 X7 TRUE TRUE TRUE TRUE TRUETRUE TRUE X8 X9 X10 X11 X12 X13 X14 FALSE TRUE TRUE TRUE TRUE TRUE FALSEX15 X16 X17 X18 X19 X20 X21 TRUE TRUE TRUE NA TRUE TRUE TRUE X22 X23 X24X25 X26 X27 X28 TRUE TRUE TRUE TRUE TRUE TRUE FALSE

TABLE 19 Pathologist 2, Pathologist 3 scores: X1 X2 X3 X4 X5 X6 X73.814506e−09 0.0399131 0.1954867 5.671062e−05 0.01856276 0.27571660.2292583 X8 X9 X10 X11 X12 X13 X14 2.044038e−12 0.004166467 0.009832670.003710155 0.01461007 0.03312421 0.0003367823 X15 X16 X17 X18 X19 X20X21 0.0005162036 0.2276537 0.002987705 4.267708e−06 0.0072873720.1654067 X22 X23 X24 X25 X26 X27 X28 0.02400127 0.00094977662.478456e−07 0.1591684 0.08318303 3.122143e−05 1

TABLE 20 Pathologist 1, Pathologist 3 scores thresholded at 0.01 (α = 1%level of significance):: X1 X2 X3 X4 X5 X6 X7 FALSE TRUE FALSE FALSETRUE TRUE TRUE X8 X9 X10 X11 X12 X13 X14 FALSE FALSE FALSE FALSE TRUETRUE FALSE X15 X16 X17 X18 X19 X20 X21 FALSE TRUE FALSE FALSE FALSEFALSE TRUE X22 X23 X24 X25 X26 X27 X28 TRUE FALSE FALSE TRUE TRUE FALSETRUE

Because the H score is subjective it is prone to scale factordifferences and noise at marginal cases. So, in spite of the threefeatures which showed statistically different scores between Pathologist1 and Pathologist 2, this joint data was accepted as representative of ameasuring instrument. Pathologist 1 and Pathologist 2 were combined intoa single data set for the analysis process. The results for Pathologist3 were withheld for independent testing purposes. Such tests using thePathologist 3 data would be biased towards showing an under-performancebecause of the significant differences.

The data from Pathologist 1 and Pathologist 2 were combined byconsidering them as separate cases, with the variability giving a degreeof independence between the results for any one case. When testing withsuch data the performance estimates will be biased towards a moreoptimistic value. This is because samples coming from the same patientmay occur simultaneously in the training a test subsets. This does nothowever invalidate the processes used to find the best combination offeatures, it merely biases the estimate of performance.

(2) Analysis of Variance of H-Scores

(a) Background

Within each probe, the H-scores may vary due to many reasons. To theextent they vary consistently due to the type of disease this is useful,variation due to which pathologist read the slide is instructive,whereas random variation sets a limit on the detection of the previoustwo sources of variation.

Analysis of Variance (ANOVA) is a standard technique for splitting upthe sources of variation in data and for testing its statisticalsignificance. ANOVA summarizes the total variation of a set of data as asum of terms which can be attributed to specific sources, or causes, ofvariation.

ANOVA is available in many statistical packages. The public domainpackage “R” was chosen (“The R Project for Statistical Computing”,http://www.R-project.org/).

(b) Aim

To perform ANOVA analyses on the H-score data from pathologists 1 and 2and to consider whether this data can be safely merged into a singleconsistent set for further analysis for the selection of panels.

(c) Methodology

From the database, data was selected from pathologists 1 and 2. Onlydata which was complete for a given probe was used in the ANOVA for thatprobe.

The control categories of Emphysema, Granulomatous Disease, andInterstitial Lung Disease were grouped together and called “Normal”giving 6 levels within factor Disease.

Pathologist was coded as a factor with 2 levels (Pathologist 1,Pathologist 2).

An R script was written to perform a standard ANOVA analysis for eachprobe in turn, using the factors: Disease, Pathologist, and theinteraction term Disease:Pathologist. The results are shown in below, inTable 21. “Df” is defined as the degrees of freedom. In a dataset of nobservations, knowing n-1 deviations from the mean, the nth isautomatically determined. N-1 is the number of degrees of freedom. SumSq and mean Sq are measures of variation. F is a test statisticconcerning the equality of two variances based on the F distribution.Pr(>F) is the probability used to determine whether or not thevariability is statistically significant. TABLE 21 Analysis of Varianceof H-Scores Df Sum Sq Mean Sq F value Pr(>F) Probe1 Disease 5 443.5688.71 15.8202 3.690e−13*** Pathologist 1 0.66 0.66 0.1174 0.7323Disease: 5 15.34 3.07 0.5470 0.7405 Pathologist Residuals 204 1143.935.61 Probe2 Disease 5 1067.39 213.48 24.1234   <2e−16*** Pathologist 113.02 13.02 1.4709 0.2263 Disease: 5 27.98 5.60 0.6324 0.6752Pathologist Residuals 249 2203.50 8.85 Probe3 Disease 5 1098.49 219.7021.0751   <2e−16*** Pathologist 1 6.73 6.73 0.6458 0.4224 Disease: 529.72 5.94 0.5703 0.7227 Pathologist Residuals 243 2533.16 10.42 Probe4Disease 5 631.8 126.4 9.3707 3.454e−08*** Pathologist 1 6.6 6.6 0.48690.4860 Disease: 5 13.1 2.6 0.1939 0.9647 Pathologist Residuals 2463317.1 13.5 Probe5 Disease 5 754.30 150.86 25.2826   <2e−16***Pathologist 1 14.25 14.25 2.3875 0.1236 Disease: 5 7.54 1.51 0.25280.9381 Pathologist Residuals 248 1479.80 5.97 Probe6 Disease 5 721.91144.38 11.8515 2.771e−10*** Pathologist 1 1.91 1.91 0.1568 0.6925Disease: 5 47.82 9.56 0.7850 0.5613 Pathologist Residuals 246 2996.9312.18 Probe7 Disease 5 1171.47 234.29 77.6802   <2e−16*** Pathologist 18.84 8.84 2.9294  0.08847 Disease: 5 46.36 9.27 3.0742  0.01063*Pathologist Residuals 209 630.37 3.02 Probe8 Disease 5 209.82 41.966.4352 1.201e−05*** Pathologist 1 12.66 12.66 1.9407  0.16483 Disease: 571.20 14.24 2.1838  0.05654 Pathologist Residuals 251 1636.76 6.52Probe9 Disease 5 197.21 39.44 8.4348 2.015e−07*** Pathologist 1 7.337.33 1.5681 0.2116 Disease: 5 24.56 4.91 1.0505 0.3884 PathologistResiduals 265 1239.17 4.68 Probe10 Disease 5 1113.46 222.69 39.0730  <2e−16*** Pathologist 1 1.01 1.01 0.1778  0.67371 Disease: 5 62.4512.49 2.1916  0.05635 Pathologist Residuals 213 1213.96 5.70 Probe11Disease 5 320.15 64.03 9.5553 2.416e−08*** Pathologist 1 1.28 1.280.1918 0.6618 Disease: 5 10.04 2.01 0.2996 0.9128 Pathologist Residuals245 1641.76 6.70 Probe12 Disease 5 832.26 166.45 27.8793   <2e−16***Pathologist 1 0.18 0.18 0.0307 0.8610 Disease: 5 15.16 3.03 0.50790.7701 Pathologist Residuals 248 1480.68 5.97 Probe13 Disease 5 46.5949.319 7.8408 8.674e−07*** pathologist 1 0.044 0.044 0.0368 0.8481Disease: 5 10.143 2.029 1.7069 0.1343 Pathologist Residuals 210 249.5841.188 Probe14 Disease 5 1305.69 261.14 23.9460   <2e−16*** Pathologist 128.66 28.66 2.6279  0.10630 Disease: 5 142.90 28.58 2.6208  0.02492*Pathologist Residuals 243 2649.98 10.91 Probe15 Disease 5 401.02 80.2021.268   <2e−16*** Pathologist 1 13.17 13.17 3.493 0.0630 Disease: 56.17 1.23 0.327 0.8963 Pathologist Residuals 214 807.02 3.77 Probe16Disease 5 2520.26 504.05 65.5572   <2e−16*** Pathologist 1 0.15 0.150.0194 0.8892 Disease: 5 24.29 4.86 0.6318 0.6757 Pathologist Residuals247 1899.12 7.69 Probe17 Disease 5 530.64 106.13 13.0178 2.426e−11***Pathologist 1 8.42 8.42 1.0325  0.31050 Disease: 5 109.96 21.99 2.6975 0.02131* Pathologist Residuals 266 2168.55 8.15 Probe19 Disease 51670.86 334.17 29.1960   <2e−16*** Pathologist 1 2.17 2.17 0.1895 0.6637Disease: 5 32.61 6.52 0.5698 0.7231 Pathologist Residuals 248 2838.5611.45 Probe20 Disease 5 964.71 192.94 34.2760   <2e−16*** Pathologist 18.83 8.83 1.5687 0.2116 Disease: 5 19.60 3.92 0.6963 0.6267 PathologistResiduals 245 1379.12 5.63 Probe21 Disease 5 6.927 1.385 2.0604  0.07076Pathologist 1 0.464 0.464 0.6906  0.40670 Disease: 5 1.576 0.315 0.4687 0.79945 Pathologist Residuals 263 176.830 0.672 Probe22 Disease 5640.16 128.03 31.7250   <2e−16*** Pathologist 1 1.64 1.64 0.4058 0.5247Disease: 5 18.78 3.76 0.9305 0.4617 Pathologist Residuals 247 996.814.04 Probe23 Disease 5 1915.62 383.12 46.5565   <2e−16*** Pathologist 110.77 10.77 1.3092 0.2537 Disease: 5 20.92 4.18 0.5084 0.7698Pathologist Residuals 246 2024.39 8.23 Probe24 Disease 5 516.06 103.2124.0786   <2e−16*** Pathologist 1 9.52 9.52 2.2210 0.1376 Disease: 512.48 2.50 0.5823 0.7135 Pathologist Residuals 216 925.87 4.29 Probe25Disease 5 1761.26 352.25 34.5245   <2e−16*** Pathologist 1 11.51 11.511.1285 0.2891 Disease: 5 41.49 8.30 0.8134 0.5411 Pathologist Residuals248 2530.33 10.20 Probe26 Disease 5 399.85 79.97 13.6548 1.428e−11***Pathologist 1 0.30 0.30 0.0517 0.8204 Disease: 5 14.81 2.96 0.50560.7719 Pathologist Residuals 214 1253.31 5.86 Probe27 Disease 5 117.9223.58 6.2551 1.956e−05*** Pathologist 1 0.64 0.64 0.1695 0.6810 Disease:5 25.52 5.10 1.3539 0.2431 Pathologist Residuals 212 799.31 3.77 Probe28Disease 5 1634.60 326.92 38.171   <2e−16*** Pathologist 1 8.40 8.400.981 0.3229 Disease: 5 16.15 3.23 0.377 0.8643 Pathologist Residuals267 2286.76 8.56Signif. codes: 0 {grave over ( )}***′ 0.001 {grave over ( )}**′ 0.01{grave over ( )}*′ 0.05 {grave over ( )}.′ 0.1 {grave over ( )} ′ 1

(d) Analysis of Results

In all cases (except for probe 21) the response of the probes wasrelated to disease. This is not surprising since the probes havepresumably been selected for this purpose. In no case is the response ofthe probe related to pathologist (at the p=0.05 level). This indicatesthat it would be safe to merge this data and use the two pathologists astwo measurements on the data

In a few cases, probes 7, 14, 17, there is some evidence of aninteraction term gaining significance. This indicates that there may besome difference between pathologists in their scoring of some diseases.Some of these cases may well be due to an occasional outlier in thedata.

(e) Conclusions

The results indicate that it is safe to merge this data for furtheranalysis. The data indicate that the slight interactions in some casesbetween pathologist and disease appear to be attributed to randomsources.

ii. Patient to Patient Variability

The variability from patient to patient was measured by thedisease:disease variability of section 2(a)(i)(2) (see above, “Analysisof Variance of H-Scores”).

iii. Marker-to-Marker Variability

Histograms were plotted (PathologistData.xls, worksheet: Histograms)showing the distribution of marker scores for each probe for Control vs.Cancer.

b. Marker Correlation Matrix Analyses

The population correlation coefficient (“Applied MulitvariateStatistical Analysis”, R. A. Johnson and D. W. Wichem, 2nd Ed, 1988,Prentice-Hall, N.J.) measures the amount of linear association between apair of random variables. Typically the distributions and associatedparameters of the random variables are not known and the populationcorrelation coefficient cannot be directly computed. In this case it ispossible to compute the sample correlation coefficient from sample data.See FIG. 4. The sample correlation coefficient is, however, only anestimate of the population correlation coefficient. Moreover, because itis calculated on the basis of sample data it is possible, purely bychance, that it may indicate a strong positive or negative correlationwhen in reality there may be no actual relationship between thecorresponding random variables (“Modern Elementary Statistics”, J. E.Freund, 6th Ed, 1984, Prentice-Hall, N.J.).

The correlation coefficient measures the ability of one variable topredict the other. A strong linear association does not, however, implya causal relationship. The square of the correlation coefficient iscalled the coefficient of determination. The coefficient ofdetermination computed for a bivariate data set measures the proportionof the variability in one variable that can be accounted for by itslinear relationship to the other. When dealing with several variables,the correlation coefficient can be calculated for each pair in turn andthe set of coefficients can be written as a matrix called thecorrelation matrix. See FIG. 4.

The H-scores for the individual markers can be modeled as randomvariables. The sample correlation matrix for this multivariate data setcan be computed from the input data described in the section titled“Input Data”, above.

c. Pattern Recognition

Statistical pattern recognition is an approach to classifying signals orgeometric objects on the basis of quantitative measurements (calledfeatures). Statistical pattern recognition essentially reduces to theproblem of dividing the n-dimensional feature space into regions thatcorrespond to the categories or classes of interest.

Three different classifier methodologies employed in this study aresensitive to different structural forms within the data.

For the Decision Tree method a preliminary analysis of different datacombinations identified markers which were never used by C4.5 for thedetection panel. These were removed from the analysis and this resultedin more consistent results, symptomatic of the left-out probes onlycontributing noise to the selection process.

Similarly a preliminary analysis of probes used in the detection panelsidentified the noisy probes for removal prior to the detailed analysis.

The Linear Discriminant Function method in SPSS has built-in stepwiseprocesses for reducing the numbers of markers in the analysis.Typically, this reduced the probes used in the analysis to between 2 and7.

The Logistic Regression method in R and SAS implement stepwiseprocedures for variable selection. In SAS, a best subsets variableselection option is also provided. In R, the stepwise methodology wasused in conjunction with multiple random trials to develop a heuristicmethod for selecting variables based on the number of times a givenfeature was used in 100 random selections of training and test data(split 70%:30% respectively). Features with counts comparable to thecount for artificial random feature were progressively eliminated untila minimal consistent set of features was obtained over 100 runs.

i. Statistical Methods

From the point of view of multivariate statistical analysis, the problemis one of estimating density functions in high-dimensional space (andpartitioning this space into the regions of interest). Assuming that thedistributions of random (feature) vectors are known, the theoreticallybest classifier is the Bayes classifier because it minimizes theprobability of classification error (K. Fukunaga, “Statistical PatternRecognition”, 2^(nd) Ed., Academic Press 1990, p. 3). Unfortunately theimplementation of the Bayes classifier is difficult because of itscomplexity, especially when the dimensionality of the feature space ishigh. In practice, simpler parametric classifiers are used. Parametricclassifiers are based on assumptions about the underlying density ordiscriminant functions. The most common such classifiers are linear andquadratic classifiers. In multivariate statistical analysis suchclassifiers fall under the heading of discriminant analysis.Discriminant analysis techniques are closely related to multivariatelinear regression models and generalized linear models (encompassinglogistic and multinomial regression).

(1) Logistic Regression with a Binomial Response

(a) Background

The problem of selecting a set of markers to be used on a detectionpanel can be formulated as a logistic regression problem with a binomialresponse. The response variable is a factor with two levels: normal (nocancer) and abnormal (cancer). The explanatory variables are the markerH-scores.

The problem of selecting a set of markers to be used on a cancerdiscrimination panel can also be formulated as a logistic regressionproblem with a binomial response. The response variable is a factor withtwo levels: normal (not the cancer of interest) and abnormal (cancer ofinterest). The explanatory variables are the marker H-scores.

Stepwise variable selection can be used to select a subset of theoriginal variables (markers) for use in discriminating between the twoclasses. This is a computationally expensive exercise and is best suitedto a computer. Several commercial- and public domain softwarepackages—e.g., R, S-plus, and SAS—implement stepwise logisticregression.

Two different approaches to feature selection were investigated based onthe stepwise variable selection procedures found in R and SASrespectively.

(b) Experimental Data

The data used for the present analysis consists of the H-scores formarkers 1-17, and 19-28 for the cases examined by Pathologist 1 andPathologist 2 and described elsewhere in this report. In addition, adummy marker, 18, was added to the data set. The dummy marker consistsof integer values from 0 to 12 selected at random from a uniformdistribution.

(c) Method 1: Using the R Package (Version 1.4.1)

Computerized model fitting procedures generally cannot deal with missingdata. This is the case for the glm (glm stands for generalized linearmodel) procedure used in R. Consequently when fitting a model using glmit was necessary to exclude all the cases for which there are one ormore missing values. When fitting the initial full model, containing the27 real markers and the single dummy marker, this reduces the data setto only 202 cases. With so few observations it was decided that the bestway to perform variable selection, to train a classifier using theselected variables, and to assess its performance was to undertake 100trials on random partitions of the data into train and test sets.

(i) Partitioning the Data into Train and Test Sets

At the start of each trial, the data is partitioned into a test set anda training set. This is done by randomly choosing 30% of the abnormalsand 30% of the normals to form the test set, and using the remainingobservations to form the training set.

(ii) Variable (Marker) Selection

At the start of each trial, the full model, which includes all of thevariables (markers), is fitted to the training data. In R the logisticregression model is fitted using glm. The code fragment used is asfollows:

-   my.model<-Class˜X1+X2+X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12+X13+X14+X15+X16+X17+X18+X19+X20+X21+X22+X23+X24+X25+X26+X27+X28-   my.glm<-glm(my.model, family=binomial(link=logit),    data=training.data)

The procedure stepAIC is then used to perform stepwise variableselection based on the Akaike Information Criterion (AIC). Thisprocedure is part of the publicly available MASS library. The libraryand the procedure are described in “Modern Applied Statistics withS-PLUS” (W. N. Venables and B. D. Ripley, Springer-Verlag, Pathologist3ew York, 1999). The R code fragment to do this is as follows:

-   my.step<-stepAIC(my.glm, direction=both)

The resulting model is then assessed on the test data. The code fragmentused is as follows: probability_is_abnormal <-predict(my.step,testing.data,type=“response”)

The performance of the classifier is recorded in terms of the actualerror rate of misclassification (AER) and the area under the ROC curve(AUC). After the 100 trials, 100 models and their associated AERs andAUCs remain. A frequency table is constructed, recording the number oftimes each variable made an appearance in the 100 models. An example isshown in Table 22: TABLE 22

This table is used to decide which markers to discard. First, all of themarkers that have a frequency less than or equal to 10 are discarded.Next a cut-off frequency is chosen based on the frequency of the dummymarker (typically this is 1 or 1.5 times that of the dummy marker). Allmarkers with a frequency less than this cut-off value are discarded. Theremaining markers, along with the dummy marker, are then used as thefull model for another 100 trials and the pruning process is repeated.If necessary, the severity of the pruning can be increased to force oneor more markers out of the model. If necessary, the remaining markerscan be used as the full model for yet another 100 trials. Pruning stopswhen the desired number of panel members is reached or the average AUCfor the current model is less than that for the preceding model.

To illustrate the pruning process consider the table above. The tablewas obtained using the detection panel data. The shaded entries indicatethose markers that are retained after pruning. Another 100 trials isperformed using the following full model:

-   my.model<-Class˜X6+X7+X8+X12+X16+X18+X23+X25

Again, a frequency table, Table 23 is constructed: TABLE 23

The shaded entries show the markers retained after pruning (using acutoff of 47). Another 100 trials is performed using the following fullmodel:

-   my.model<-Class˜X6+X7+X8+X12+X18+X23+X25

Again, a frequency table, Table 24 is constructed: TABLE 24

At this point a cut-off of 50 is chosen. The shaded entries show theremaining markers for use on a 5 member panel. In each step, the averageAUC increases: 94.37%→95.45%→95.78%.

(iii) Assessing the Performance of the Panel

To assess the performance of the panel, 100 trials were performed, asbefore, but without the stepwise selection procedure. For each trial,the AUC, sensitivity, and specificity are recorded. For the detectionpanel example above, the results are:

-   >my-model <-Class˜X7+X25+X6+X23+X12-   >summary(AUC)    -   Min. 1st Qu. Median Mean 3rd Qu. Max.    -   0.9289 0.9590 0.9615 0.9601 0.9630 0.9630-   >summary(sensitivity)    -   Min. 1st Qu. Median Mean 3rd Qu. Max.    -   0.8519 0.9630 0.9630 0.9737 1.0000 1.0000-   >summary(specificity)    -   Min. 1st Qu. Median Mean 3rd Qu. Max.    -   0.8378 0.9730 0.9730 0.9749 1.0000 1.0000

In summary, the panel has a sensitivity of 97.37% and a specificity of97.49%. The area under the ROC is 96.01%.

(d) Method 2: Using SAS (Version 8.2)

Logistic regression can be performed in SAS using the procedureLOGISTIC. When the response variable is a two-level factor, theprocedure fits a binary logit model (equivalent to glm in R withfamily=binomial and link=logit). SAS automatically excludes all of themissing multivariate observations for the model specified. Unlike R, SASis able to perform a best subsets variable selection procedure. The codefragment in SAS needed to do this is as follows: PROC LOGISTICDATA=WORK.panel;   CLASS Class;   MODEL Class = X1 X2 X3 X4 X5 X6 X7 X8X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26X27 X28 /SELECTION=     SCORE BEST=28; RUN;

This procedure is applied to the entire data set. The parameter BEST=28directs SAS to find the best 28 single-variable models, the best 28two-variable models, the best 28 three-variable models, up to the best28 28-variable models.

(i) Assessing the Performance of the Panels

The procedure described in method 1 is used to assess the performance ofeach of the panels. The following, Table 25, was generated from thedetection panel data. It lists results only for the two best one-, two-,three-, four-, and five-marker panels. TABLE 25 Panel Panel membersSensitivity Specificity Area under ROC 1 7 94.28% 2 28 80.14% 3 7, 1695.00% 4 7, 15 94.59% 5 7, 15, 16 95.94% 6 1, 7, 16 95.33% 7 1, 7, 15,16 95.61% 8 4, 7, 15, 16 95.34% 9 1, 4, 7, 15, 16 95.30% 10 1, 7, 11,15, 16 95.57%

(2) Linear Discriminant Analysis

(a) Background

The commercial statistical package SPSS has procedures allowing simplelinear discriminant functions to be design and tested.

A commonly used method is Fisher's Linear discriminant function. Thisfinds the hyper-plane in feature space which gives a good separation ofclasses. For a two class problem where the class distributions havedifferent means, but similar multivariate Gaussian distributions, thisclassifier gives optimum performance. The method can be extendedheuristically to multi-class problems, but this was not applied in thestudy.

The method is simplistic in its approach but robust to problemsassociated with data sets containing a large number of features (theprobes in our case number 27, giving problem for a data set comprisingonly some two hundred exemplars (cases)).

This package has a procedure for identifying the features whichcontribute well to the discrimination process. This “stepwise method”first finds the most discriminating feature. Other features are thensequentially added and evaluated against the classifier. Combinationsare explored so the final solution may exclude features initiallyselected if better combinations are found. The number of features isgradually increased until a statistical test shows the remainingfeatures do not contribute reliably to the classification process.

An estimate of the performance is gained by using the leave one outmethod. This removes one sample from the data set to form the trainingset. The left out sample is retained as the test set, applied to theclassifier, and the resulting classification accumulated in theconfusion matrix. The procedure is repeated for case in the data. Thisprocedure gives an unbiased estimate of performance, but the estimatewill have a high variance.

Method

In SPSS select the appropriate data set for analysis, select “Analyze”,select “Classify”, select “Discriminant . . . ”, on the table select“Fishers method”, “leave one out testing” and “use stepwise method”.Enter the diagnosis as the grouping variable and enter all the featuresas the independents. Enter “OK” to complete the analysis. Pre-set valuesfor other parameters were left as set.

The analysis output includes a list of the features used in theanalysis, the canonical discriminant function and a confusion matrix andthe correct-classification rate (1-error rate).

In order to compute an ROC curve the Canonical discriminant function isapplied to the selected features to generate a new feature. In SPSS useGraphs, ROC to plot this curve

ii. Hierarchical Methods: Decision Trees

(1) Background

Decision tree learning is one of the most widely used and practicalmethods for inductive inference. It is a method for classification thatis robust to noisy data and capable of learning disjunctive expressions(Tom M. Mitchell, “Machine Learning”, McGraw-Hill, New York, N.Y.,1997.)

The most popular and accessible machine learning package is “C4.5” thesource code of which is published in: (J. Ross Quinlan, “C4.5: Programsfor Machine Learning”, Morgan Kaufmann, San Mateo Calif., 1993).

When a decision tree is being trained (on training data), the algorithmdecides at each node of the tree which single attribute of the data touse at this node to best make a decision. Therefore when the tree iscompletely constructed, it will have selected some set of attributes touse and ignored others. In our application, using decision trees toprocess measurements gained from molecular probes, the decision tree haseffectively chosen a panel of probes, and a method of combining theprobe scores, which best explains the classification of the data. Toobtain an unbiased estimate of the panel performance, the resulting treemust be evaluated on data which was not used in the training. Onestandard technique for doing this is cross-validation. A 10-foldcross-validation was employed.

Cross-validation is a technique for making the very best use of limiteddata. In 10-fold cross-validation the data is randomly split into 10nearly-equal sized partitions, taking care to have approximately thesame number of cases in a class across each partition. Then, thedecision tree is trained on partitions 2-9 combined and tested onpartition 1, then trained on partitions 1,3-9 combined and tested onpartition 2, and so on for 10 trials rotating the held-out test setthrough the data once. In this manner tests are only ever performed onheld-out data and so are unbiased, and all data is tested exactly onceso an aggregate error rate across the whole data set can be computed.

Trees are usually constructed until they are a very good fit to thetraining data, then they are “pruned” back by clipping off “noisy”branches and leaves. This improves the generalization ability of thedecision tree on unseen data and is essential to obtain goodperformance. The C4.5 package includes two methods for pruning treesfirst a standard tree pruning algorithm, second a rule extractionalgorithm. In general, the tree based method was found to give superiorresults on this data. Therefore, the rule-based method is not reported.

(2) Data Preparation

Data on the response of various probes to normal tissue and fivedifferent cancers (Adenocarcinoma, Large Cell Carcinoma, Mesothelioma,Small Cell Lung Cancer, and Squamous Cell Carcinoma) was obtained asdescribed elsewhere. The H-scores for probes 1-28, and pathologistsPathologist 1 and Pathologist 2 were extracted from the database and putinto a flat data file. For the decision tree analysis each data point(even by two pathologists on a same physical slide) was taken to be anindependent observation of the effect of disease on staining. This mayslightly positively bias the performance of classification but shouldhave no effect on panel selection.

-   -   The control categories of Emphysema, Granulomatous Disease, and        Interstitial Lung Disease were grouped together and called        “Normal”.    -   For the detection panel all the cancers were grouped together        and called “Abnormal” making this a 2-class problem.    -   For the single discrimination panel, the Normal cases were        removed from the data to form a 5-class problem.    -   For the hold-out discrimination panels, each cancer was held out        in turn and the remaining cancers grouped into “Other” to give a        set of five 2-class problems.

C4.5 requires a “.names” file which describes the data and theattributes to be included in the analysis. An example names file for thediscrimination panel is, Table 26: TABLE 26 | | C4.5 Names file forMonoGen ZF21 diag data | Adenocarcinoma, Large Cell Carcinoma,Mesothelioma, Small Cell Lung Cancer, Squamous Cell Carcinoma. | classesP1: continuous. P2: continuous. P3: continuous. P4: continuous. P5:continuous. P7: continuous. P8: continuous. P9: continuous. P10:continuous. P11: continuous. P12: continuous. P13: continuous. P14:continuous. P15: continuous. P16: continuous. P17: continuous. P18:ignore. P19: continuous. P20: continuous. P21: continuous. P22:continuous. P23: continuous. P24: continuous. P25: continuous. P26:continuous. P27: continuous. P28: continuous.

Probe 18 was missing from the data and was set to “ignore” in all thedesigns. Setting attributes to “ignore” in the names file is an easy andeffective way of trimming probes from the panels and is used in the dataanalysis.

(3) Data Analysis

Ten-fold cross validation was run on each data set using the “xval.sh”script supplied with C4.5. Standard (default) parameters for the packagewere used. Cross validation is a technique developed for classifiertraining and testing on small data sets. It involves randomly splittingthe data into N equal sized partitions. The clasifier is then trained onN−1 partitions together and tested on the remianing partition. This isrepeated N times.

Since the decision tree trained in one cross-validation (CV) trial maydiffer from the tree obtained in another (different in both probesselected, and tree coefficients) the number of times each probe wasselected by the tree in 10 trials was computed.

The first cull of probes was done by setting to ignore any probe whichdid not occur in a pruned tree 5 or more times out of the 10 CV trials.

Then the cross-validation was repeated with this smaller set ofcandidate probes. The second cull of probes was done by setting toignore any probe which did not occur in a pruned tree 5 or more timesout of the 10 CV trials. If any further probes dropped out, a third CVrun was done.

The panels were selected by the various runs, and their estimated errorperformance are shown in the results tables. The panel performance fordecision tree analysis is shown below, in Table 27. TABLE 27 PanelPerformance - Decision Trees Cancer Control Detection Panel Cancer99.42%  0.58% Probes: 3, 7, 19, 25 and 28 Control 17.82% 82.18% AdenoOthers Pair-wise Discrimination Adeno 67.74% 32.26% 4, 6, 14, 19 and 23Others 11.20% 88.80% Squamous Others Pair-wise Discrimination Squamous70.59% 29.41% 3, 6, 17, 19 and 25 Others  4.07% 95.93% Large Cell OthersPair-wise Discrimination Large Cell 36.36% 63.64% 1, 5, 10, 13, 21, 27and 28 Others  7.37% 92.63% Mesothlioma Others Pair-wise DiscriminationMesothelioma 82.05% 17.95% 3, 12 and 16 Others  5.00% 95.00% Small CellOthers Pair-wise Discrimination Small Cell 69.23% 30.77% 12, 17, 20, 23and 25 Others  1.49% 98.51% Cancer Control Detection (without probe 7)Cancer 89.60% 10.40% 6, 10, 16 and 19 Control  3.30% 96.70% CancerControl Detection (only commercially Cancer 92.80%  7.20% preferredprobes) Control  5.49% 94.51% 5, 6, 10, 16, 19 and 23

An example decision tree structure is shown in below, in Tables 28 and29, for discriminating between Small Cell Lung Cancer and the remainingfour types of cancer.

C4.5 Output Format: TABLE 28 P23 <= 3: | P25 <= 2: Small Cell LungCancer (18.0) | P25 > 2: | | P17 <= 5: Small Cell Lung Cancer (2.0) | |P17 > 5: | | | P20 <= 11: Other (9.0) | | | P20 > 11: Small Cell LungCancer (2.0) P23 > 3: | P12 > 7: Other (120.0) | P12 <= 7: | | P20 <= 2:Other (5.0) | | P20 > 2: Small Cell Lung Cancer (4.0) Tree savedEvaluation on training data (160 items): Before Pruning After PruningSize Errors Size Errors Estimate 13 0(0.0%) 13 0(0.0%) (5.2%) <<

TABLE 29 Pictorial format:

The panel performance for stepwise linear discrminant is shown below, inTable 30: TABLE 30 Panel Performance - Stepwise LD Cancer ControlDetection Panel Cancer 92.24%  7.76% 1, 4, 7, 15 and 16 Control  1.16%98.84% Adeno Others Pair-wise Discrimination Adeno 91.67%  8.33% 4, 5,14, 19, 20, 25 and 27 Others  5.43% 94.57% Squamous Others Pair-wiseDiscrimination Squamous 88.00% 12.00% 1, 2, 3, 24, 25 and 26 Others 6.59% 93.41% Large Cell Others Pair-wise Discrimination Large Cell80.95% 19.05% 1 and 7 Others 26.32% 73.68% Mesothelioma Others Pair-wiseDiscrimination Mesothelioma 96.67%  3.33% 3, 12 and 16 Others  4.65%95.35% Small Cell Others Pair-wise Discrimination Small Cell 93.75% 6.25% 12, 19, 22 and 23 Others  5.00% 95.00% Cancer Control Detection(without probe 7) Cancer 85.34% 14.66% 1, 2, 3, 4, 10, 11, 15, Control 2.33% 97.67% 16, 23, 24, 27 and 28 Cancer Control Detection (onlycommercially Cancer 81.20% 18.80% preferred probes) Control  1.16%98.84% 8, 10, 11, 19, 23 and 28

The panel performance for stepwise logistic regression analysis is shownbelow, in Table 31: TABLE 31 Panel Performance - Stepwise LR CancerControl Detection Panel Cancer 97.49%  2.63% 6, 7, 12, 23 and 24 Control 2.51% 97.49% Adeno Others Pair-wise Discrimination Adeno 96.39%  3.61%14, 19, 20, 25 and 27 Others 12.29% 87.71% Squamous Others Pair-wiseDiscrimination Squamous 94.93%  5.07% 3 and 10 Others 35.86% 64.14%Large Cell Others Pair-wise Discrimination Large Cell 95.11%  4.89% 1,4, 6, 16 and 21 Others 61.00% 39.00% Mesothelioma Others Pair-wiseDiscrimination Mesothelioma 95.07%  4.93% 3, 7, 12 and 16 Others 10.89%89.11% Small Cell Others Pair-wise Discrimination Small Cell 98.90% 1.10% 12, 13 and 23 Others  4.00% 96.00% Cancer Control Detection(without probe 7) Cancer 94.00%  6.00% 1, 10, 19, 23 and 28 Control 5.80% 94.20% Cancer Control Detection (only commercially Cancer 93.88% 6.12% preferred probes) Control  6.39% 93.61% 10, 19, 20, 23 and 28

iii. Neural Networks and alternative methods

Artificial neural networks ANN's are candidate pattern recognitiontechniques which could readily be applied to select features and designclassifiers in association with this invention. However such techniquesgive little insight to the structure of the data and the influence ofparticular probes in the way that LDF gives. For this reason this classof algorithm was not used in this study. LDF stands for lineardiscriminant function, a linear combination of features whose result isthresholded to determine the classification.

This class of techniques includes algorithms such as Multi-LayerPerceptron MLP, Back-Prop, Kohonen's Self-Organizing Maps, LearningVector Quantization, K-nearest neighbors and Genetic Algorithms.

iv. Special Topics

(1) Assumptions

-   -   Linear discriminant analysis        -   Assumes the covariance matrices for the two classes are            equal.        -   Minimizes the cost of misclassification only when the two            classes are multivariate normal.        -   Assumes that the explanatory variables are continuous rather            than categorical (in this study, the H-scores are            categorical while in practice (i.e., in an automated system)            intensity can be measured on a continuous scale).    -   Logistic regression (binomial generalized linear models)

See Venerables and Ripley, chapter 7 (“Modern Applied Statistics withS-PLUS” (W. N. Venables and B. D. Ripley, Springer-Verlag, New York,1999)).

(2) Marker Rejection (De-Selection)

Computerized implementations of discriminant analysis and regressionprocedures include stepwise variable selection procedures; e.g., stepAICin R. These procedures are designed to select the best subset ofvariables for use as explanatory variables. In reality, because of thestep-by-step nature of these procedures, there is no guarantee that thebest variables are selected for prediction (Johnson and Wichern, p.299). Nevertheless such procedures do provide the basis for markerselection and de-selection.

(3) Pairwise Tests

Inherent problems in designing multiclass classifiers is discussed in“Applied Mulitvariate Statistical Analysis”, R. A. Johnson and D. W.Wichern, 2nd Ed, 1988, Prentice-Hall, N.J. This is motivation fordeveloping several separate two-class classifiers (discriminationpanel).

(4) Redundancy Consideration in Panel Composition

“Linear models form the core of classical statistics and are still thebasis of much of statistical practice” “Modern Applied Statistics withS-PLUS” (W. N. Venables and B. D. Ripley, Springer-Verlag, New York,1999. Linear models are the foundation for the t-test, analysis ofvariance (ANOVA), regression analysis, as well as a variety ofmultivariate methods including discriminant analysis. Explanatoryvariables may or may not enter the model as first-order terms. This istrue also of (non-linear) logistic regression. The logistic regressionmodel is simply a non-linear transformation of the linear regressionmodel: the dependent variable is replaced by a log odds ratio (logit).In summary these statistical methods are based on linear relationshipsbetween the explanatory variables. Consequently, one avenue for seekingredundancy in panels is to identify highly correlated variables(markers). It may be possible to replace one marker with the other in apanel to achieve similar performance.

Another avenue for seeking redundancy in panels is to undertake a “bestsubsets” regression analysis. Given a starting model with all of theexplanatory variables of interest, the aim is to find the bestsingle-variable regression models, the best two-variable regression,etc. This methodology is implemented in the SAS statistical package.

(5) Use of Weighting Scores

(a) Commercial and Clinical Considerations

For many reasons, including strategic and commercial factors; cost;availability; ease of use, it may be preferred to encourage theselection of certain probes in a panel and penalize the selection ofothers, at the same time trading this off against panel size orperformance.

(b) Attribute Costing

Methods for such attribute weighting (in decision trees) have beenproposed in the machine learning literature in other contexts such asthe incorporation of background knowledge (M. Nunez, “The Use ofBackground Knowledge”, Machine Learning 6: 231-250, 1991.), and thedifferential cost of obtaining information from robotic sensors (M. Tan,“Cost-sensitive Learning of Classification Knowledge and itsApplications in Robotics”, Machine Learning. 13: 7-33, 1993.)

Both of these cost-sensitive algorithms have been implemented in theliterature by minor changes to the standard machine learning softwarepackage known as “C4.5 (J. Ross Quinlan, “C4.5: programs for machinelearning”, Morgan Kaufmann, C A. 1993.) For convenience, this approachwas followed to implement the “EG2” algorithm of Nunez.

In the C4.5 decision tree construction phase, the algorithm compareseach available attribute to split on and chooses the single one whichmaximizes the information gain, Gi. In the EG2 algorithm,(2^(Gi)−1)/(Ci+1) is maximized which incorporates the cost ofinformation for attribute i, Ci. The vector of weights need to be set apriori by the user.

(i) Code Modifications

The C4.5 source code was modified to implement the economic generalizer“EG2” algorithm proposed by M. Nunez (The Use of Background Knowledge,Machine Learning 6: 231-250, 1991.)

The exact modifications to the C4.5 package are as follows.

After the following lines in file “R8/Src/contin.c”. (J. Ross Quinlan,“C4.5: programs for machine learning”, Morgan Kaufmann, C A. 1993)ForEach(i, Xp, Lp − 1)   {     if ( (Val = SplitGain[i] − ThreshCost) >BestVal )     {       BestI = i;       BestVal = Val;     }   }

The new line: BestVal = (powf(2.0, BestVal) − 1.0) /(AttributeCosts[Att] + 1.0);is inserted. Where the vector of attribute costs has been previouslyread in from a text file maintained by the user.

(ii) Experimental Methodology.

The commercially preferred probes are: 2, 4, 5, 6, 8, 10, 11, 12, 16,19, 20, 22, 23, 28.

For the sake of example, suppose the above probes are commerciallypreferred due to cost and it is desired to reselect the detection paneltaking this cost into account.

The modified C4.5 decision tree software was used to give thecommercially preferred probes a penalty of zero and non-commerciallypreferred probes a penalty of two. The 10-fold cross validated panelselection methodology (as described elsewhere) was run using themodified C4.5 algorithm

(iii) Results

The standard decision tree detection panel consists of probes 3, 7, 19,25, 28. Resulting Panel Members: are 2, 6, 7, 10, 19, 25, 28 which usedonly 2 commercially preferred probes, P7 and P25. Note these probes havebeen selected by the method in spite of their increased cost due totheir superior performance on this data. The panel is now larger: 7probes versus 5 originally. There is no demonstratable drop in panelperformance on this data although the performance will now besub-optimal as a trade off against the reduced cost of probes.

(iv) Conclusion

A straightforward way has been established for incorporating costs ofusing probes into the panel selection methodology.

(c) Misclassification Costing

(i) Background

For many reasons it may be desired to select an optimal panel bearing inmind that the costs of the different kinds of classification errors mayvary. For example, it may be desired to select a panel which has anincreased sensitivity to one disease (say Large Cell Carcinoma) and bewilling to trade this off against reduced specificity and sensitivityelsewhere in the confusion matrix.

In theory a matrix of misclassification costs (of the same dimensions asthe confusion matrix) to incorporate all the possible combinations ofcosts may be needed. In practice, only those costs which are non unity(the default) are entered.

The commercial decision tree software See 5. (RuleQuest Research PtyLtd, 30 Athena Avenue, St Ives Pathologist 3SW 2075, Australia.(http://www.rulequest.com)) incorporates this capability and was used inthe following demonstration.

(ii) Aim

The standard joint discrimination panel (described elsewhere) consistsof the members: P2, 3, 4, 5, 12, 14, 16, 19, 22, 23, 28. And gives thefollowing estimated confusion (a) (b) (c) (d) (e) <-classified as 24 4 25 2 (a): class Adenocarcinoma 8 7 3 5 4 (b): class Large Cell Carcinoma1 1 33 1 4 (c): class Mesothelioma 6 2 1 23 (d): class Small Cell LungCancer 4 4 3 2 24 (e): class Squamous Cell Carcinoma

The sensitivity of Large Cell Carcinoma is low at 26 percent. If onewished to increase this sensitivity in a newly designed panel, thefollowing method may be employed.

(iii) Methodology

The following costs file was generated: | costs file for ZF21Discrim | |Increase sensitivity for “Large Cell Carcinoma” | Mesothelioma, LargeCell Carcinoma: 10 Adenocarcinoma, Large Cell Carcinoma: 10Mesothelioma, Large Cell Carcinoma: 10 Small Cell Lung Cancer, LargeCell Carcinoma: 10 Squamous Cell Carcinoma, Large Cell Carcinoma: 10

This file upweights the misclassification of Large Cell Carcinoma as anyof the other cancers by a factor of 10. This will tend to increase thesensitivity of detection in this class (with reduced performanceelsewhere) but no weighting can ensure perfect classification.

The standard decision tree panel selection methodology was applied(using See 5 instead of C4.5).

(iv) Results

The new panel members are: P2, 3, 4, 5, 6, 9, 12, 14, 16, 17, 25, 28.With an estimated performance of: (a) (b) (c) (d) (e) <-classified as 2013 1 1 2 (a): class Adenocarcinoma 3 13 3 2 6 (b): class Large CellCarcinoma 1 9 27 2 1 (c): class Mesothelioma 2 9 21 (d): class SmallCell Lung Cancer 1 15 2 1 18 (e): class Squamous Cell Carcinoma

The above demonstrates that the estimated sensitivity of Large CellCarcinoma has now increased to 48%.

(v) Conclusion

A straight forward way has been demonstrated for incorporating thedifferential costs of misclassification into the panel selectionmethodology.

d. Performance Metrics

Outputs provided by the analysis indicating the estimated performance ofeach method include:

i. ROC Analyses

Receiver Operating Characteristic (ROC) curves show the estimatedpercentage (or per unit probability) of false positive and falsenegative scores for different threshold levels in the classifier. Anindifferent classifier, unable to discriminate better than randomchoice, would present a ROC curve with equal true and false readings.The area under this curve would be 50% (0.5 probability).

Area Under the Curve (AUC) is often used as an overall estimate ofclassifier performance and most commercial discriminant functionpackages compute this figure. A perfect classifier would have 100% AreaUnder the Curve, a useless classifier would have an AUC near 50% (0.5).

ii. Confusion matrices: counts and percentages

Confusion matrices show how data from the test set was classified. Forpair wise tests these are counts of true positive, false positive, truenegative or false negative scores. These may be shown as actual countsor as percentages. For the multi-way Panel, which attempts to give aunique diagnosis with one panel only, the confusion matrix would showcounts for each correct classification. For instance, each time SmallCell carcinoma is detected as such it would be entered in one diagonalof the matrix. Incorrect scores; for instance how often a small cellcarcinoma is incorrectly identified as squamous cell cancer would beentered in the appropriate off-diagonal element of the matrix. ErrorRates are used to summarize data in the confusion matrix as the sum ofall false classifications divided by the total number of classificationsmade, expressed as a percentage.

iii. Sensitivity and specificity

Specificity refers to the extent to which any definition excludesinvalid cases. If a definition has poor specificity, it is high in falsepositives. This means that it labels individuals as having a disorderwhen there is really no disorder present. Sensitivity refers to theextent to which any definition includes all valid cases. If a definitionhas poor sensitivity, it is high in false negatives (individuals whohave a disorder present are falsely being diagnosed as not having one).

3. Data Analysis and Results

a. Sample Size and Variability

-   -   Of the 354 cases in the combined Pathologist 1 and Pathologist 2        data set, only 202 cases possessed an H score for every marker        (variable or feature).    -   The small number of complete observations and the large number        of variables leads to estimation problems (curse of        dimensionality). Hence it is necessary to prune severely back        the number of variables used to build a classifier.    -   Due to the small number of observations it is not prudent to        divide the data into separate training and testing sets        (necessary for the robust estimation of classifier performance).        For this reason, it was necessary to use resampling methods        (such as cross-validation and multiple random trials).    -   The design of a multiclass classifier for cancer discrimination        is difficult because there are so few observations for each type        of cancer.

b. De-Selected Markers

Markers were de-selected using the methodology described above. Markersthat were de-selected are represented by non-selection in the panels.

c. Detection Panel(s) Composition

i. Selected Marker Probes

The selected marker probes for all three methods are summarized in FIG.5.

ii. Minimum Selected Marker Set

For the detection panel it is clear that probe 7 delivered the bestdetection performance for a single marker. Combinations of probes wereanalyzed to see if a reliable panel could be obtained with more probes.

(1) Method

The Logistic Regression method allows best subsets to be ranked in termsof a performance measure (Fisher score). This analysis was used toselect the combinations from 1 through 5 probes. Fishers lineardiscriminant function and logit models (logistic regression) were usedto illustrate the performance of these combinations. Data shown above.

(2) Conclusions

Probe 7 performs well on its own as a classifier; however, a drawback tousing probe 7 alone is that probe 7 has a high false negative score. Thebest performance using Fishers linear discriminant function as aclassifier was with probes 7 and 16. The variability of results amongstpanels using other combinations suggests the noise added by morefeatures is outweighing any potential to improve classification scores.The small number of incorrectly scored samples gives a poorrepresentation of the statistics of these rarer events. A classifierdesigned with a larger number of cases may allow a better classifier tobe designed. Techniques to select best combinations of probes usingdifferent classifiers may produce a different best panel, depending onthe structure of the data.

iii. Supplemental Markers

It is shown that panels can be designed to suit the availability ofdifferent probes. Different methodologies can be used for selectingthese subsets: Decision Trees, Logistic Regression, and LinearDiscriminant Functions. Data are shown above.

Method

Using SPSS a Fisher's Linear Discriminant function was applied to thescores obtained from the panel in which constrains were applied due toaccess constraints. For example, all of the probes come from one vendor.Again, the stepwise option was selected to find the best combination offeatures. Performance was estimated using the Leave-One-Out crossvalidation test.

iv. Alternative Markers: Biological Mechanisms of Action (FunctionallyEquivalent Markers)

A person of ordinary skill in the art is able to determine functionallyequivalent markers. The functional behaviors of the markers used in thepanel are described throughout this document.

v. Marker Localization

The localizations of the various markers used in this study aredescribed elsewhere in this document.

vi. Panel Performance

The performance of the three methods is shown above.

vii. Limitations on Interpretation of Panel Performance

-   -   Due to small data set and the need to employ resampling methods,        there is the danger that the classifiers have been over-trained        (made to fit the data too closely).    -   The panel performance using cytology specimens is difficult to        forecast accurately since it is not clear whether sputum        cytology samples will contain adequate numbers of cells that are        representative of the cells analyzed in the histological        validation studies. Nevertheless, given an adequate cellular        sample size, one would expect the optimized panel to behave        similarly with cytological specimens.

d. Discriminant Panel Composition

i. A Single 5-Way Panel for all Cancers

Of the three analysis techniques, only a decision tree is amenable to asingle 5-way panel. A single decision tree was therefore constructed tosimultaneously classify all types of lung cancer. The panel members areshown FIG. 5. The panel performance is shown above in the panelperformance tables.

-   -   ii. Panels for Discriminating a Single Type of Lung Cancer        Against all others

Linear discriminant functions are not well suited to performingsimultaneous multi-class discrimination. The performance of fiveseparate classifiers, each designed separately to discriminate one ofthe cancers from a pooled set of all the cancers, was analyzed. Suchcombinations have the potential to classify none of the cases as havingone of the candidate cancers, or classify a single case as having two ormore of the candidate cancers. This has a potential advantage inidentifying inconsistent cases for further review.

It has been seen that the overall error rate of a single discriminantpanel for all cancer types has a fairly high error rate (a five wayclassifier). In the panel performance data shown above, the performanceof five pair-wise classifiers, each designed to identify one cancer fromthe four other possible cancers is shown. This approach is amenable toanalysis using Decision Trees, and Linear Discriminant functions. Thetechnique has the potential to deliver an ambiguous finding whenapplied, giving two or more diagnoses for a single patient, suggestingfurther clinical investigation. The technique has the potential todeliver no finding, again suggesting further investigation (perhaps are-test with the detection panel).

iii. Panels to Account for Possibility of False Positive Cases fromDetection Panels

A further panel can be trained to discriminate among the false positivecases (from the detection panel) and the five cancer types. Thisinvolves selecting those individual cases from the detection panel thatwere incorrectly classified as abnormal. This trains a dedicatedclassifier on the ‘harder’ problem of detecting these ‘special’ cases.However, while this is a theoretically sound task, the data set onlyyielded four of these cases and the population was deemed to beunder-represented for analysis.

iv. Selected Markers

The selected marker probes for all three methods are summarized in FIG.5.

v. Minimum Selected marker Set

This topic is addressed below under “Robustness of Approach Demonstratedby Similar Results Using Different Methods.”

vi. Supplemental Markers

This topic is addressed below under “Robustness of Approach Demonstratedby Similar Results Using Different Methods.”

vii. Alternative Markers: Biological Mechanisms of Action

A person of ordinary skill in the art is able to determine functionallyequivalent markers. The functional behaviors of the markers used in thepanel are described throughout this document.

viii. Marker Localization

The localization of the various markers used in this study are describedthroughout this document.

ix. Panel Performance

The performance of the three methods is summarized in FIG. 5.

e. Effect of Weighting Parameters

In addition to user-supplied weighting criteria for markers and also fordisease states (classes) as discussed earlier, one can also use a binaryweighting scheme. For example, if all non-DAKO supplied probes areweighted “0” and all DAKO-supplied probes are weighted “1”, then theoptimized panel will contain only DAKO-supplied probes. This is animprotant product design capability for any vendor who intends todevelop and market molecular diagnostic panel kits using only theirsupplies.

f. Effect of Using other (Non H-Score) Objective Scoring Parameters

i. Background

The Pathology Review sheet contains a set of boxes as follows, in Table32: TABLE 32 Intensity None Weak Moderate Intense  0-5% 0 0 0 0  6-25% 11 1 1 26-50% 2 2 2 2 51-75% 3 3 3 3 >75% 4 4 4 4

The standard scoring system uses the “H score” which is obtained bygrading the intensity as: none=0, weak=1, moderate=2, intense=3, and thepercentage cells as: 0-5%=0, 6-25%=1, 26-50%=2, 51-75%=3, >75%=4, andthen multiplying the two grades together. For example, 50% weaklystained plus 50% moderate stained would score 10=2×2+2×3.

ii. Method

An alternative scoring method was analyzed in which the response wasdivided into low, medium and high as follows:

-   (a) if more than 50% of cells had moderate or above stain HIGH-   (b) if more than 50% of cells had no stain LOW-   (c) otherwise MEDIUM

The decision tree detection panel selection methodology was repeatedusing this 3-level factor instead of H-score. This caused the tree tosplit into 3 branches at each node, if required.

iii. Results

The panel selected was: Probes 3, 7, 10, 11, 16, 19, 20, 28

With an estimated performance of: Classified as (a) (b) Control (a) 7922 Specificity = 78% Cancer (b) 24 149 Sensitivity = 86%

This should be compared to the reference performance with H-scores of:Classified as (a) (b) Control (a) 85 6 Specificity = 93% Cancer (b) 5120 Sensitivity = 96%

iv. Conclusions

-   -   There is a substantial loss of performance (larger panels, lower        sensitivity and lower specificity when the proposed alternative        scoring system is used.    -   Treating the H-score as a continuous variable (in the range 0        to 12) seems to be near optimal for panel selection on the data        examined.    -   The many other possible scoring systems have not been examined,        but may be feasible and applicable to the experimentally tested        panel design and development methodology.

4. Lung Cancer Detection and Discrimination Panels

Listed below are exemplary lung cancer detection and discriminationpanels determined by the above illustrative example. It is noted thatalthough the panels listed below recite specific probes, each specificprobe may be substituted by a correlate probe or a functionally relatedprobe.

Detection (No Constraints)

-   -   anti-Cyclin A combined with one or more additional probes    -   anti-Cyclin A, anti-human epithelial related antigen (MOC-31)    -   anti-Cyclin A, anti-ER-related P29    -   anti-Cyclin A, anti-mature surfactant apoprotein B    -   anti-Cyclin A, anti-human epithelial related antigen (MOC-31),        anti-VEGF    -   anti-Cyclin A, anti-human epithelial related antigen (MOC-31),        anti-mature surfactant apoprotein B    -   anti-Cyclin A, anti-mature surfactant apoprotein B, anti-human        epithelial related antigen (MOC-31), anti-VEGF    -   anti-Cyclin A, anti-mature surfactant apoprotein B, anti-human        epithelial related antigen (MOC-31), anti-surfactant apoprotein        A    -   anti-Cyclin A, anti-mature surfactant apoprotein B; anti-human        epithelial related antigen (MOC-31), anti-VEGF, anti-surfactant        apoprotein A    -   anti-Cyclin A, anti-mature surfactant apoprotein B, anti-human        epithelial related    -   antigen (MOC-31), anti-VEGF, anti-Cyclin D1    -   anti-Cyclin A, anti-human epithelial related antigen (MOC-31)        combined with one or more additional probes    -   anti-Cyclin A, anti-ER-related P29 combined with one or more        additional probes    -   anti-Cyclin A, anti-mature surfactant apoprotein B combined with        one or more additional probes    -   anti-Cyclin A, anti-human epithelial related antigen (MOC-31),        anti-VEGF combined with one or more additional probes    -   anti-Cyclin A, anti-human epithelial related antigen (MOC-31),        anti-mature surfactant apoprotein B combined with one or more        additional probes    -   anti-Cyclin A, anti-mature surfactant apoprotein B, anti-human        epithelial related antigen (MOC-31), anti-VEGF combined with one        or more additional probes    -   anti-Cyclin A, anti-mature surfactant apoprotein B, anti-human        epithelial related antigen (MOC-31), anti-surfactant apoprotein        A combined with one or more additional probes    -   anti-Cyclin A, anti-mature surfactant apoprotein B, anti-human        epithelial related antigen (MOC-31), anti-VEGF, anti-surfactant        apoprotein A combined with one or more additional probes    -   anti-Cyclin A, anti-mature surfactant apoprotein B, anti-human        epithelial related antigen (MOC-31), anti-VEGF, anti-Cyclin D1        combined with one or more additional probes        Detection (W/O Anti-Cyclin A)    -   anti-Ki-67 combined with one or more additional probes.    -   anti-Ki-67 combined with any one probe selected from the group        consisting of anti-VEGF, anti-human epithelial related antigen        (MOC-31), anti-TTF-1, anti-EGFR, anti-proliferating cell nuclear        antigen and anti-mature surfactant apoprotein B.    -   anti-Ki-67 combined with any two probes selected from the group        consisting of anti-VEGF, anti-human epithelial related antigen        (MOC-31), anti-TTF-1, anti-EGFR, anti-proliferating cell nuclear        antigen and anti-mature surfactant apoprotein B.    -   anti-Ki-67 combined with any three probes selected from the        group consisting of anti-VEGF, anti-human epithelial related        antigen (MOC-31), anti-TTF-1, anti-EGFR, anti-proliferating cell        nuclear antigen and anti-mature surfactant apoprotein B.    -   anti-Ki-67 combined with any four probes selected from the group        consisting of anti-VEGF, anti-human epithelial related antigen        (MOC-31), anti-TTF-1, anti-EGFR, anti-proliferating cell nuclear        antigen and anti-mature surfactant apoprotein B.    -   anti-Ki-67 combined with any five probes selected from the group        consisting of anti-VEGF, anti-human epithelial related antigen        (MOC-31), anti-TTF-1, anti-EGFR, anti-proliferating cell nuclear        antigen and anti-mature surfactant apoprotein B.    -   anti-Ki-67, anti-VEGF, anti-human epithelial related antigen        (MOC-31), anti-TTF-1, anti-EGFR, anti-proliferating cell nuclear        antigen and anti-mature surfactant apoprotein B    -   anti-Ki-67 combined with any one probe selected from the group        consisting of anti-VEGF, anti-human epithelial related antigen        (MOC-31), anti-TTF-1, anti-EGFR, anti-proliferating cell nuclear        antigen and anti-mature surfactant apoprotein B, and with one or        more additional probes.    -   anti-Ki-67 combined with any two probes selected from the group        consisting of anti-VEGF, anti-human epithelial related antigen        (MOC-31), anti-TTF-1, anti-EGFR, anti-proliferating cell nuclear        antigen and anti-mature surfactant apoprotein B, and with one or        more additional probes.    -   anti-Ki-67 combined with any three probes selected from the        group consisting of anti-VEGF, anti-human epithelial related        antigen (MOC-31), anti-TTF-1, anti-EGFR, anti-proliferating cell        nuclear antigen and anti-mature surfactant apoprotein B, and        with one or more additional probes.    -   anti-Ki-67 combined with any four probes selected from the group        consisting of anti-VEGF, anti-human epithelial related antigen        (MOC-31), anti-TTF-1, anti-EGFR, anti-proliferating cell nuclear        antigen and anti-mature surfactant apoprotein B, and with one or        more additional probes.    -   anti-Ki-67 combined with any five probes selected from the group        consisting of anti-VEGF, anti-human epithelial related antigen        (MOC-31), anti-TTF-1, anti-EGFR, anti-proliferating cell nuclear        antigen and anti-mature surfactant apoprotein B, and with one or        more additional probes.    -   anti-Ki-67, anti-VEGF, anti-human epithelial related antigen        (MOC-31), anti-TTF-1, anti-EGFR, anti-proliferating cell nuclear        antigen, anti-mature surfactant apoprotein B and one or more        additional probes.        Detection With Commerically Preferred Probes    -   anti-Ki-67 combined with one or more additional probes.    -   anti-TTF-1 combined with one or more additional probes.    -   anti-EGFR combined with one or more additional probes.    -   anti-proliferating cell nuclear antigen combined with one or        more additional probes.    -   two probes selected from the group consisting of anti-Ki-67,        anti-TTF-1, anti-EGFR and anti-proliferating cell nuclear        antigen.    -   three probes selected from the group consisting of anti-Ki-67,        anti-TTF-1, anti-EGFR and anti-proliferating cell nuclear        antigen.    -   anti-Ki-67, anti-TTF-1, anti-EGFR and anti-proliferating cell        nuclear antigen    -   two probes selected from the group consisting of anti-Ki-67,        anti-TTF-1, anti-EGFR and anti-proliferating cell nuclear        antigen, and one or more additional probes.    -   three probes selected from the group consisting of anti-Ki-67,        anti-TTF-1, anti-EGFR and anti-proliferating cell nuclear        antigen, and one or more additional probes.    -   anti-Ki-67, anti-TTF-1, anti-EGFR, anti-proliferating cell        nuclear antigen, and one or more additional probes.        Discrimination Between Adenocarcinoma and Other Lung Cancers    -   anti-mucin 1 and anti-TTF-1    -   anti-mucin 1 and anti-TTF-1 combined with any one probe selected        from the group consisting of anti-VEGF, anti-surfactant        apoprotein A, anti-BCL2, anti-ER-related P29 and anti-Glut 3        anti-mucin 1 and anti-TTF-1 combined with and two probes        selected from the group consisting of anti-VEGF, anti-surfactant        apoprotein A, anti-BCL2, anti-ER-related P29 and anti-Glut 3    -   anti-mucin 1 and anti-TTF-1 combined with any three probes        selected from the group consisting of anti-VEGF, anti-surfactant        apoprotein A, anti-BCL2, anti-ER-related P29 and anti-Glut 3    -   anti-mucin 1 and anti-TTF-1 combined with any four probes        selected from the group consisting of anti-VEGF, anti-surfactant        apoprotein A, anti-BCL2, anti-ER-related P29 and anti-Glut 3    -   anti-VEGF, anti-surfactant apoprotein A, anti-mucin 1,        anti-TTF-1, anti-BCL2, anti-ER-related P29 and anti-Glut 3    -   anti-mucin 1, anti-TTF-1 and one or more additional probes    -   anti-mucin 1 and anti-TTF-1 combined with any one probe selected        from the group consisting of anti-VEGF, anti-surfactant        apoprotein A, anti-BCL2, anti-ER-related P29 and anti-Glut 3,        and with one or more additional probes    -   anti-mucin 1 and anti-TTF-1 combined with and two probes        selected from the group consisting of anti-VEGF, anti-surfactant        apoprotein A, anti-BCL2, anti-ER-related P29 and anti-Glut 3,        and with one or more additional probes    -   anti-mucin 1 and anti-TTF-1 combined with any three probes        selected from the group consisting of anti-VEGF, anti-surfactant        apoprotein A, anti-BCL2, anti-ER-related P29 and anti-Glut 3,        and with one or more additional probes    -   anti-mucin 1 and anti-TTF-1 combined with any four probes        selected from the group consisting of anti-VEGF, anti-surfactant        apoprotein A, anti-BCL2, anti-ER-related P29 and anti-Glut 3,        and with one or more additional probes    -   anti-VEGF, anti-surfactant apoprotein A, anti-mucin 1,        anti-TTF-1, anti-BCL2, anti-ER-related P29, anti-Glut 3 and one        or more additional probes        Discrimination Between Squamous Cell Carcinoma and Other Lung        Cancers.    -   anti-CD44v6 combined with one or more additional probes    -   anti-CD44v6 combined with any one probe selected from the group        consisting of anti-VEGF, anti-thrombomodulin, anti-Glut 1,        anti-ER-related P29 and anti-melanoma-associated antigen 3    -   anti-CD44v6 combined with any two probes selected from the group        consisting of anti-VEGF, anti-thrombomodulin, anti-Glut 1,        anti-ER-related P29 and anti-melanoma-associated antigen 3    -   anti-CD44v6 combined with any three probes selected from the        group consisting of anti-VEGF, anti-thrombomodulin, anti-Glut 1,        anti-ER-related P29 and anti-melanoma-associated antigen 3    -   anti-CD44v6 combined with any four probes selected from the        group consisting of anti-VEGF, anti-thrombomodulin, anti-Glut 1,        anti-ER-related P29 and anti-melanoma-associated antigen 3    -   anti-CD44v6, anti-VEGF, anti-thrombomodulin, anti-Glut 1,        anti-ER-related P29 and anti-melanoma-associated antigen 3    -   anti-CD44v6 combined with any one probe selected from the group        consisting of anti-VEGF, anti-thrombomodulin, anti-Glut 1,        anti-ER-related P29 and anti-melanoma-associated antigen 3, and        with one or more additional probes    -   anti-CD44v6 combined with any two probes selected from the group        consisting of anti-VEGF, anti-thrombomodulin, anti-Glut 1,        anti-ER-related P29 and anti-melanoma-associated antigen 3, and        with one or more additional probes    -   anti-CD44v6 combined with any three probes selected from the        group consisting of anti-VEGF, anti-thrombomodulin, anti-Glut 1,        anti-ER-related P29 and anti-melanoma-associated antigen 3, and        with one or more additional probes    -   anti-CD44v6 combined with any four probes selected from the        group consisting of anti-VEGF, anti-thrombomodulin, anti-Glut 1,        anti-ER-related P29 and anti-melanoma-associated antigen 3, and        with one or more additional probes    -   anti-CD44v6, anti-VEGF, anti-thrombomodulin, anti-Glut 1,        anti-ER-related P29, anti-melanoma-associated antigen 3 and one        or more additional probes        Discrimination Between Large Cell Carcinoma and Other Lung        Cancers    -   anti-VEGF combined with one or more additional probes. anti-VEGF        and anti-p120    -   anti-VEGF and anti-Glut 3 anti-VEGF, anti-p120 and anti-Cyclin A    -   anti-VEGF, anti-p120 and one or more additional probes    -   anti-VEGF, anti-Glut 3 and one or more additional probes    -   anti-VEGF, anti-p120, anti-Cyclin A and one or more additional        probes        Discrimination Between Mesothelioma and Other Lung Cancers    -   anti-CD44v6 combined with one or more additional probes.    -   anti-proliferating cell nuclear antigen combined with one or        more additional probes.    -   anti-human epithelial related antigen (MOC-31) combined with one        or more additional probes.    -   two probes selected from the group consisting of anti-CD44v6,        anti-proliferating cell nuclear antigen and anti-human        epithelial related antigen (MOC-31), combined with one or more        additional probes    -   anti-CD44 v6, anti-proliferating cell nuclear antigen,        anti-human epithelial related antigen (MOC-31) and one or more        additional probes.        Discrimination Between Small Cell and Other Lung Cancers    -   anti-proliferating cell nuclear antigen combined with one or        more additional probes.    -   anti-BCL2 combined with one or more additional probes.    -   anti-EGFR combined with one or more additional probes.    -   two probes selected from the group consisting of        anti-proliferating cell nuclear antigen, anti-BCL2 and anti-EGFR    -   anti-proliferating cell nuclear antigen, anti-BCL2, anti-EGFR    -   two probes selected from the group consisting of        anti-proliferating cell nuclear antigen, anti-BCL2 and        anti-EGFR, combined with one or more additional probes    -   anti-proliferating cell nuclear antigen, anti-BCL2, anti-EGFR        and one or more additional probes        Simultaneous Discrimination of Adenocarcinoma, Squamous Cell        Carcinoma, Large Cell Carcinoma, Mesothelioma and Small Cell        Carcinoma    -   two or more probes selected from anti-VEGF, anti-thrombomodulin,        anti-CD44v6, anti-surfactant apoprotein A, anti-proliferating        cell nuclear antigen, anti-mucin 1, anti-human epithelial        related antigen (MOC-31), anti-TTF-1, anti-N-cadherin, anti-EGFR        and anti-proliferating cell nuclear antigen    -   anti-VEGF, anti-thrombomodulin, anti-CD44v6, anti-surfactant        apoprotein A, anti-proliferating cell nuclear antigen,        anti-mucin 1, anti-human epithelial related antigen (MOC-31),        anti-TTF-1, anti-N-cadherin, anti-EGFR and anti-proliferating        cell nuclear antigen    -   two or more probes selected from anti-VEGF, anti-thrombomodulin,        anti-CD44v6, anti-surfactant apoprotein A, anti-proliferating        cell nuclear antigen, anti-mucin 1, anti-human epithelial        related antigen (MOC-31), anti-TTF-1, anti-N-cadherin, anti-EGFR        and anti-proliferating cell nuclear antigen, combined with one        or more additional probes    -   anti-VEGF, anti-thrombomodulin, anti-CD44v6, anti-surfactant        apoprotein A, anti-proliferating cell nuclear antigen,        anti-mucin 1, anti-human epithelial related antigen (MOC-31),        anti-TTF-1, anti-N-cadherin, anti-EGFR and anti-proliferating        cell nuclear antigen, combined with one or more additional        probes

5. Conclusions

a. Validity of Panel Approach to Molecular Diagnostics

i. Non-Intuitive Solutions

Histograms were plotted (PathologistData.xls, worksheet: Histograms)showing the distribution of marker scores for each probe for Control vs.Cancer. It is clear from these histograms that an intuitive selection ofprobes for specific panels is certainly not obvious and the inventiondescribed does allow effective combinations to be found in the absenceof an obvious method.

ii. Optimization for Varied Product Applications

iii. Robustness of Approach Demonstrated by Similar Results UsingDifferent Methods

Detailed scrutiny of the results obtained by the various analyses in thebody of this report, and as summarized in the tables and figures, showsthe following findings.

-   -   1. Careful scrutiny of the performance of individual probes does        not make apparent probe combinations that might perform better        than any one probe alone.    -   2. All three classification methodologies evaluated hone in on        similar sets of features. The small differences can be        attributed to the data structure that may favor one classifier        over another.    -   3. All the classifiers designed with one of these methods were        shown to give good performance when tested on data from an        independent pathologist, unseen during the design process. This        gives high confidence in the invention.    -   4. A detection panel based on probe 7 alone gives a high        performance.    -   5. If probe 7 is combined with probe 16 or 25 then a better        performance is obtained.    -   6. While combinations of other probes with probe 7 appear to        improve performance further, the number of extra cases captured        is so low that they may be unrepresentative and the classifier        so designed may not generalize.    -   7. The performance of panels selected from probes excluding        probe 7 provided some discrimination, good enough in comparison        with current practice using human screening, but perhaps not        good enough for an automated cytometer in tomorrow's clinical        diagnostic cytology world (see FIG. 6).    -   8. Other combinations of probes can provide a useful, but        lesser, performance.    -   9. If some probes become unavailable this invention allows the        selection of other combinations of probes. This was illustrated        by classifier designs based on a commercially preferred set of        probes only. See FIG. 7.    -   10. The invention allows a weighting to be applied against        costly probes. Rather than totally excluding them from the        analysis this allows their inclusion in the panel if their        contribution is important.    -   11. The invention allows the design of single lung cancer type        specific discrmination panels that can discriminate one type of        lung cancer from among all other cancers.    -   12. Analysis of the performance of a single panel to classify        five cancers showed discrimination was possible but the overall        error rate was worse than a set of five panels each designed to        discriminate one of the cancers from the others.    -   13. A very useful discrimination was obtained with the        combination of five two way classifiers.    -   14. Common sets of probes were selected by the three        classification methodologies for the five discrimination panels,        again giving confidence in this result.    -   15. Probes for isolating cases of Adenocarcinoma are 1, 14, 19,        20, 25, and 27.    -   16. Probes for isolating cases of Squamous Cell cancer are 1, 2,        3, 24, 25, and 26.    -   17. Probes for isolating cases of Large Cell cancer are 1 and 7        or 1, and 21.    -   18. Probes for isolating cases of Mesothelioma are 3, 12, and        16.    -   19. Probes for isolating cases of Small Cell cancer are 12, 20,        and 23.    -   20. Probes for recognizing all cancers simultaneously are 1, 2,        3, 4, 12, 14, 19, 22, 23, and 28.    -   21. An advantage of using the multiple pair-wise panels as        defined by this invention is that doubtful cases may not score        on any of the five panels, also confusing cases may show on two        or more panels. Such anomalous reports would alert the        cytologist that further analysis is indicated.

iv. Risk Management Study

All the tests applied in this study were statistical in nature. There isa risk that probes selected on the basis of small improvements inperformance will have statistical variations when tested on new data. Togive confidence in the results, the best classifier emerging from theLinear Discriminant analysis on the Pathologist 1 and Pathologist 2 datawas tested. It should be remembered that the Pathologist 3 data wasstatistically different from the Pathologist 1 and Pathologist 2 data,so if good performances are obtained when tests using the Pathologist 3data, then this would be encouraging indeed.

(1) Report on Testing with Unseen Data—Detection Panel

(a) Method

In the Section titled “Detection Panel(s) Composition” above, we showedthat good classification is obtained with features 7 and 16. Using SPSSall the Pathologist 3 data that reported H scores for both 7 and 16 wasselected. Then, using Transform and Compute, the canonicaldiscrimination function was generated as a new feature. The performanceof this feature alone was then tested.

(b) Results

These are the results of testing the classifier designed on Pathologist1 and Pathologist 2 data and testing on Pathologist 3 data. Theclassifier was designed using the linear discriminant function on probes7 and 16. The Canonical Pathologist 2 function was=0.965*Probe7−0.298*Probe16.

Classification Results on Pathologist 3 Data Using Probes 7 and 16Predicted Group Diagnosis Membership (UCLA) 0 1 Total Original Count 020 1 21 1 6 41 47 % 0 95.2 4.8 100.0 1 12.8 87.2 100.0 Cross-validatedCount 0 20 1 21 1 6 41 47 % 0 95.2 4.8 100.0 1 12.8 87.2 100.0

-   a cross validation is done only for those cases in the analysis. In    cross validation, each case is classifies by the functions derived    from all cases other than that case.-   b 89.7% of original grouped cases correctly classified.-   c 89.7% of cross-validated grouped cases correctly classified.

This is better than classifying the Pathologist 3 data on probe 7 onlyshow as follows

Classification Results on Pathologist 3 Data Using Probe 7 OnlyPredicted Group Diagnosis Membership (UCLA) 0 1 Total Original Count 020 1 21 1 8 39 47 % 0 95.2 4.8 100.0 1 17.0 83.0 100.0 Cross-validatedCount 0 20 1 21 1 8 39 47 % 0 95.2 4.8 100.0 1 17.0 83.0 100.0

-   a Cross validation is done only for those cases in the analysis. In    cross validation, each case is classified by the functions derived    from all cases other than that case.-   b 86.8% of original grouped cases correctly classified.-   c 86.8% of cross-validated grouped cases correctly classified.

(c) Conclusion

This gives confidence that the two-probe classifier based on 7 and 16 isbetter than probe 7 alone

(5) Report on Testing with Unseen Data—Discrimination Panel

(a) Background

Reported below is the performance of the classifier designed withPathologist 1 and Pathologist 2 data using LDF and tested with theunseen Pathologist 3 data. The numbers of cases at the design stage wasrelatively small and the numbers in the test data are also small, so agood degree of variability can be expected between performance on thefirst and second

(b) Method

In SPSS, the canonical discrimination functions derived in the sectiontitled “Pattern recognition”, were built and tested on Pathologist 3data for all five classes of cancer

(c) Results

Mesothelioma LDF— probe3sc*0.385−probe12s*0.317+probe16s*1.006

Classification Results Predicted Group Meso = 1, Membership others = 0 01 Total Original Count 0 38 2 40 1 1 7 8 % 0 95.0 5.0 100.0 1 12.5 87.5100.0 Cross- Count 0 38 2 40 validated 1 1 7 8 % 0 95.0 5.0 100.0 1 12.587.5 100.0

-   a Cross validation is done only for those cases in the analysis. In    cross validation, each case is classifies by the functions derived    from all cases other than that case.-   b 93.8% of original grouped cases correctly classified.-   c 93.8% of cross-validated grouped cases correctly classified.

Small cell cancerLDF=probe12s*0.575−probe20s*0.408−probe22s*0.423+probe23s*0.344

Classification Results Predicted Group Small = 1, Membership others = 00 1 Total Original Count 0 39 3 42 1 1 5 6 % 0 92.9 7.1 100.0 1 16.783.3 100.0 Cross- Count 0 39 3 42 validated 1 1 5 6 % 0 92.9 7.1 100.0 116.7 83.3 100.0

-   a Cross validation is done only for those cases in the analysis. In    cross validation, each case is classified by the functions derived    from all cases other than that case.-   b 91.7% of original grouped cases correctly classified.-   c 91.7% of cross-validated grouped cases correctly classified.

Squamous cell cancerLDF=−probe1sc*0.328−probe2sc*0.295+probe3sc*0.741+probe24s*0.490+probe25s*0.393+probe26s*0.426

Classification Results Predicted Group Squamous = 1, Membership others =0 0 1 Total Original Count 0 31 4 35 1 2 9 11 % 0 88.6 11.4 100.0 1 18.281.8 100.0 Cross- Count 0 31 4 35 validated 1 2 9 11 % 0 88.6 11.4 100.01 18.2 81.8 100.0

-   a Cross validation is done only for those cases in the analysis. In    cross validation, each case is classified by the functions derived    from all cases other than that case.-   b 87.0% of original grouped cases correctly classified.-   c 87.0% of cross-validated grouped cases correctly classified.

Large cell cancer LDF=probe1sc*0.847+probe7sc*0.452

Classification Results Predicted Group Large = 1, Membership others = 00 1 Total Original Count 0 23 15 38 1 4 5 9 % 0 60.5 39.5 100.0 1 44.455.6 100.0 Cross- Count 0 23 15 38 validated 1 4 5 9 % 0 60.5 39.5 100.01 44.4 55.6 100.0

-   a Cross validation is done only for those cases in the analysis. In    cross validation, each case is classified by the functions derived    from all cases other than that case.-   b 59.6% of original grouped cases correctly classified.-   c 59.6% of cross-validated grouped cases correctly classified.

The lower, but useful, performance was on a classifier designed andtested with a very small number of cases of large cell cancer, so thisresult is still very encouraging.

Adenocarcinoma,LDF=−probe4sc*0.515+probe5sc*−0.299−probe14s*0.485−probe19s*0.347+probe2os*0.723+probe25s*0.327+probe27s*0.327

Classification Results Predicted Group Adeno = 1, Membership Others = 00 1 Total Original Count 0 29 5 34 1 0 14 14 % 0 85.3 14.7 100.0 1 .0100.0 100.0 Cross- Count 0 29 5 34 validated 1 0 14 14 % 0 85.3 14.7100.0 1 .0 100.0 100.0

-   a Cross validation is done only for those cases in the analysis. In    cross validation, each case is classified by the functions derived    from all cases other than that case.-   b 89.6% of original grouped cases correctly classified.-   c 89.6% of cross-validated grouped cases correctly classified.

(d) Conclusion

It is very encouraging to note the performance of these classifiersstand up to the tests of applying unseen data. This gives a very highconfidence in the ability to detect the individual cancers.

(3) Training and Testing on Data from Different Patients andPathologists

As a “final final” test of robustness a LDF was trained on the data thatwas reviewed by both Pathologist 1 and Pathologist 2. This removes datareviewed by Pathologist 3. Hence testing on data reviewed by bothPathologist 3 plus Pathologist 1 data is not biased. Previously the testprocess was biased through using data from the same patient for test andtrain.

LDF produced the same set of features except for probe 4 which was notincluded. The LDFwas=probe1sc*0.288+probe7sc*0.846−probe15s*0.249−probe16s*0.534

Classification Results

Area Under the Curve=0.977 Predicted Group Diagnosis Membership (UCLA) 01 Total Original Count 0 20 0 20 1 9 37 46 % 0 100.0 .0 100.0 1 19.680.4 100.0 Cross- Count 0 20 0 20 validated 1 9 37 46 % 0 100.0 .0 100.01 19.6 80.4 100.0

-   a Cross validation is done only for those cases in the analysis. In    cross validation, each case is classified by the functions derived    from all cases other than that case.-   b 86.4% of original grouped cases correctly classified.-   c 86.4% of cross-validated grouped cases correctly classified.

Still a reasonable result, but a similar result, but with a smaller areaunder the curve, was obtained with probe7 alone on Pathologist 3 onlydata

Classification Results

Area Under the curve=0.908 Predicted Group Diagnosis Membership (UCLA) 01 Total Original Count 0 19 1 20 1 7 39 46 % 0 95.0 5.0 100.0 1 15.284.8 100.0 Cross- Count 0 19 1 20 validated 1 7 39 46 % 0 95.0 5.0 100.01 15.2 84.8 100.0

-   a Cross validation is done only for those cases in the analysis: In    cross validation, each case is classified by the functions derived    from all cases other than that case.-   b 87.9% of original grouped cases correctly classified.-   c 87.9% of cross-validated cases correctly classified.

REFERENCES

-   [1] Goldberg-Kahn, B., Healy, J. C. and Bishop, J. W. (1997) The    cost of diagnosis: a comparison of four different strategies in the    workup of solitary radiographic lung lesions. Chest 111, 870-6.-   [2] O'Donovan, P. B. (1997) The radiologic appearance of lung    cancer. Oncology (Huntingt) 11, 1387-402; discussion 1402-4.-   [3] Worrell, J. A. (1995) Radiology of the central airways.    Otolaryngol Clin North Am 28, 701-20.-   [4] Henschke, C. I., Miettinen, O. S., Yankelevitz, D. F.,    Libby, D. M. and Smith, J. P. (1994) Radiographic screening for    cancer. Proposed paradigm for requisite research. Clin Imaging 18,    16-20.-   [5] Lam, S., Kennedy, T., Unger, M., Miller, Y. E., Gelmont, D.,    Rusch, V., Gipe, B., Howard, D., LeRiche, J. C., Coldman, A. and    Gazdar, A. F. (1998) Localization of bronchial intraepithelial    neoplastic lesions by fluorescence bronchoscopy. Chest 113, 696-702.-   [6] Sazon, D. A., Santiago, S. M., Soo Hoo, G. W., Khonsary, A.,    Brown, C., Mandelkem, M., Blahd, W. and Williams, A. J. (1996)    Fluorodeoxyglucose-positron emission tomography in the detection and    staging of lung cancer. Am J Respir Crit Care Med 153, 417-21.-   [7] Lowe, V. J., DeLong, D. M., Hoffman, J. M. and    Coleman, R. E. (1995) Optimum scanning protocol for FDG-PET    evaluation of pulmonary malignancy. J Nucl Med 36, 883-7.-   [8] Lowe, V. J., Fletcher, J. W., Gobar, L., Lawson, M., Kirchner,    P., Valk, P., Karis, J., Hubner, K., Delbeke, D., Heiberg, E. V.,    Patz, E. F. and Coleman, R. E. (1998) Prospective investigation of    positron emission tomography in lung nodules. J Clin Oncol 16,    1075-84.-   [9] Raab, S. S., Hornberger, J. and Raffin, T. (1997) The importance    of sputum cytology in the diagnosis of lung cancer: a    cost-effectiveness analysis. Chest 112, 937-45.-   [10] Franklin, W. A. (1998) New molecular and cellular approaches to    lung cancer detection. In: Biology of Lung Cancer, pp. 529-570.-   [11] Kern, W. H. (1988) The diagnostic accuracy of sputum and urine    cytology. Acta Cytol 32, 651-4.-   [12] Mehta, A. C., Marty, J. J. and Lee, F. Y. (1993) Sputum    cytology. Clin Chest Med 14, 69-85.-   [13] Gledhill, A., Bates, C., Henderson, D., DaCosta, P. and    Thomas, G. (1997) Sputum cytology: a limited role. J Clin Pathol 50,    566-8.-   [14] Steffee, C. H., Segletes, L. A. and Geisinger, K. R. (1997)    Changing cytologic and histologic utilization patterns in the    diagnosis of 515 primary lung malignancies. Cancer 81, 105-15.-   [15] Zaman, M. B. (1991) Pulmonary cytology. Clin Lab Med 11,    293-315.-   [16] Flehinger, B. J. and Melamed, M. R. (1994) Current status of    screening for lung cancer. Chest Surg Clin N Am 4, 1-15.-   [17] Koss, L. G., Melamed, M. R. and Goodner, J. T. (1964) Pulmonary    cytology: A brief survey of diagnostic results from Jul. 1, 1952    until Dec. 31, 1960. Acta Cytol 8, 104.-   [18] Saccomanno, G., Saunders, R. P., Ellis, H., Archer, V. E.,    Wood, B. G. and Beckler, P. A. (1963) Concentration of carcinoma or    atypical cells in sputum. Acta Cytol 5, 305-310.-   [19] Miura, H., Konaka, C., Kawate, N., Tsuchida, T. and    Kato, H. (1992) Sputum cytology-positive, bronchoscopically negative    adenocarcinoma of the lung [see comments]. Chest 102, 1328-32.-   [20] Valatis, J., Warrens, D. and Gamble, D. (1981) Increased    incidence of adenocarcinoma of the lung. Cancer 47, 1042-1046.-   [21] Baldini, E. H. and Strauss, G. M. (1997) Women and lung cancer:    waiting to exhale. Chest 112, 229S-234S.-   [22] Caldwell, C. J. and Berry, C. L. (1996) Is the incidence of    primary adenocarcinoma of the lung increasing? Virchows Arch 429,    359-63.-   [23] Risse, E. K., Vooijs, G. P. and van't Hof, M. A. (1987)    Relationship between the cellular composition of sputum and the    cytologic diagnosis of lung cancer. Acta Cytol 31, 170-6.-   [24] Holiday, D. B., McLarty, J. W., Farley, M. L., Mabry, L. C.,    Cozens, D., Roby, T., Waldron, E., Underwood, R. D., Anderson, E.,    Culbreth, W. and et al. (1995) Sputum cytology within and across    laboratories. A reliability study. Acta Cytol 39, 195-206.-   [25] Eddy, D. M. (1989) Screening for lung cancer [see comments].    Ann Intern Med 111, 232-7.-   [26] Younes, M., Brown, R. W., Stephenson, M., Gondo, M. and    Cagle, P. T. (1997) Overexpression of Glut1 and Glut3 in stage I    nonsmall cell lung carcinoma is associated with poor survival.    Cancer 80, 1046-51.-   [27] Ogawa, J., Inoue, H. and Koide, S. (1997)    Glucose-transporter-type-1-gene amplification correlates with    sialyl-Lewis-X synthesis and proliferation in lung cancer. Int J    Cancer 74, 189-92.-   [28] Ito, T., Noguchi, Y., Satoh, S., Hayashi, H., Inayama, Y. and    Kitamura, H. (1998) Expression of facilitative glucose transporter    isoforms in lung carcinomas: its relation to histologic type,    differentiation grade, and tumor stage [see comments]. Mod Pathol    11, 437-43.-   [29] Sosolik, R. C., McGaughy, V. R. and De Young, B. R. (1997)    Anti-MOC-31: a potential addition to the pulmonary adenocarcinoma    versus mesothelioma immunohistochemistry panel. Mod Pathol 10,    716-9.-   [30] Ordonez, N. G. (1998) Value of the MOC-31 monoclonal antibody    in differentiating epithelial pleural mesothelioma from lung    adenocarcinoma. Hum Pathol 29, 166-9.-   [31] Takanami, I., Tanaka, F., Hashizume, T., Kikuchi, K., Yamamoto,    Y., Yamamoto, T. and Kodaira, S. (1996) The basic fibroblast growth    factor and its receptor in pulmonary adenocarcinomas: an    investigation of their expression as prognostic markers. Eur J    Cancer 32A, 1504-9.-   [32] Takanami, I., Imamura, T., Hashizume, T., Kikuchi, K.,    Yamamoto, Y., Yamamoto, T. and Kodaira, S. (0.1996)    Immunohistochemical detection of basic fibroblast growth factor as a    prognostic indicator in pulmonary adenocarcinoma. Jpn J Clin Oncol    26, 293-7.-   [33] Ohta, Y., Endo, Y., Tanaka, M., Shimizu, J., Oda, M., Hayashi,    Y., Watanabe, Y. and Sasaki, T. (1996) Significance of vascular    endothelial growth factor messenger RNA expression in primary lung    cancer. Clin Cancer Res 2, 1411-6 1996.-   [34] Volm, M., Koomagi, R., Mattern, J. and Stammler, G. (1997)    Prognostic value of basic fibroblast growth factor and its receptor    (FGFR-1) in patients with non-small cell lung carcinomas. Eur J    Cancer 33, 691-3.-   [35] Hiyama, K., Hiyama, E., Ishioka, S., Yamakido, M., Inai, K.,    Gazdar, A. F., Piatyszek, M. A. and Shay, J. W. (1995) Telomerase    activity in small-cell and non-small-cell lung cancers [see    comments]. J Natl Cancer Inst 87, 895-902.-   [36] Yashima, K., Litzky, L. A., Kaiser, L., Rogers, T., Lam, S.,    Wistuba, II, Milchgrub, S., Srivastava, S., Piatyszek, M. A.,    Shay, J. W. and Gazdar, A. F. (1997) Telomerase expression in    respiratory epithelium during the multistage pathogenesis of lung    carcinomas. Cancer Res 57, 2373-7.-   [37] Ahrendt, S. A., Yang, S. C., Wu, L., Westra, W. H., Jen, J.,    Califano, J. A. and Sidransky, D. (1997) Comparison of oncogene    mutation detection and telomerase activity for the molecular staging    of non-small cell lung cancer. Clin Cancer Res 3, 1207-14.-   [38] Albanell, J., Lonardo, F., Rusch, V., Engelhardt, M.,    Langenfeld, J., Han, W., Klimstra, D., Venkatraman, E., Moore, M. A.    and Dmitrovsky, E. (1997) High telomerase activity in primary lung    cancers: association with increased cell proliferation rates and    advanced pathologic stage. J Natl Cancer Inst 89, 1609-15.-   [39] Hiyama, K., Ishioka, S., Shay, J. W., Taooka, Y., Maeda, A.,    Isobe, T., Hiyama, E., Maeda, H. and Yamakido, M. (1998) Telomerase    activity as a novel marker of lung cancer. and immune-associated    lung diseases. Int J Mol Med 1, 545-9.-   [40] Yahata, N., Ohyashiki, K., Ohyashiki, J. H., Iwama, H.,    Hayashi, S., Ando, K., Hirano, T., Tsuchida, T., Kato, H.,    Shay, J. W. and Toyama, K. (1998) Telomerase activity in lung cancer    cells obtained from bronchial washings. J Natl Cancer Inst 90,    684-90.-   [41] Lee, J. C., Jong, H. S., Yoo, C. G., Han, S. K., Shim, Y. S.    and Kim, Y. W. (1998) Telomerase activity in lung cancer cell lines    and tissues. Lung Cancer 21, 99-103.-   [42] Arai, T., Yasuda, Y., Takaya, T., Ito, Y., Hayakawa, K.,    Toshima, S., Shibuya, C., Yoshimi, N. and Kashiki, Y. (1998)    Application of telomerase activity for screening of primary lung    cancer in broncho-alveolar lavage fluid. Oncol Rep 5, 405-8.-   [43] Fujii, M., Motoi, M., Saeki, H., Aoe, K. and    Moriwaki, S. (1993) Prognostic significance of proliferating cell    nuclear antigen (PCNA) expression in non-small cell lung cancer.    Acta Med Okayama 47, 103-8.-   [44] Kawai, T., Suzuki, M., Kono, S., Shinomiya, N., Rokutanda, M.,    Takagi, K., Ogata, T. and Tamai, S. (1994) Proliferating cell    nuclear antigen and Ki-67 in lung carcinoma. Correlation with DNA    flow cytometric analysis. Cancer 74, 2468-75.-   [45] Ogawa, J., Tsurumi, T., Yamada, S., Koide, S. and    Shohtsui A. (1994) Blood vessel invasion and expression of sialyl    Lewisx and proliferating cell nuclear antigen in stage I non-small    cell lung cancer. Relation to postoperative recurrence. Cancer 73,    1177-83.-   [46] Ebina, M., Steinberg, S. M., Mulshine, J. L. and    Linnoila, R. I. (1994) Relationship of p53 overexpression and    up-regulation of proliferating cell nuclear antigen with the    clinical course of non-small cell lung cancer. Cancer Res 54,    2496-503.-   [47] Fontanini, G., Vignati, S., Bigini, D., Merlo, G. R.,    Ribecchini, A., Angeletti, C. A., Basolo, F., Pingitore, R. and    Bevilacqua, G. (1994) Human non-small cell lung cancer: p53 protein    accumulation is an early event and persists during metastatic    progression [see comments]. J Pathol 174, 23-31.-   [48] Wiethege, T., Voss, B. and Muller, K. M. (1995) P53    accumulation and proliferating-cell nuclear antigen expression in    human lung cancer. J Cancer Res Clin Oncol 121, 371-7.-   [49] Esposito, V., Baldi, A., De Luca, A., Micheli, P., Mazzarella,    G., Baldi, F., Caputi, M. and Giordano, A. (1997) Prognostic value    of p53 in non-small cell lung cancer: relationship with    proliferating cell nuclear antigen and cigarette smoking. Hum Pathol    28, 233-7.-   [50] Caputi, M., Esposito, V., Groger, A. M., Pacilio, C., Murabito,    M., Dekan, G., Baldi, F., Wolner, E. and Giordano, A. (1998)    Prognostic role of proliferating cell nuclear antigen in lung    cancer: an immunohistochemical analysis. In Vivo 12, 85-8.-   [51] Hirata, T., Fukuse, T., Naiki, H., Hitomi, S. and    Wada, H. (1998) Expression of CD44 variant exon 6 in stage I    non-small cell lung carcinoma as a prognostic factor. Cancer Res 58,    1108-10.-   [52] Ariza, A., Mate, J. L., Isamat, M., Lopez, D., Von    Uexkull-Guldeband, C., Rosell, R., Femandez-Vasalo, A. and    Navas-Palacios, J. J. (1995) Standard and variant CD44 isoforms are    commonly expressed in lung cancer of the non-small cell type but not    of the small cell type. J Pathol 177, 363-8.-   [53] Fasano, M., Sabatini, M. T., Wieczorek, R., Sidhu, G.,    Goswami, S. and Jagirdar, J. (1997) CD44 and its v6 spliced variant    in lung tumors: a role in histogenesis? Cancer 80, 34-41.-   [54] Miyoshi, T., Kondo, K., Hino, N., Uyama, T. and    Monden, Y. (1997) The expression of the CD44 variant exon 6 is    associated with lymph node metastasis in non-small cell lung cancer.    Clin Cancer Res 3, 1289-97.-   [55] Tran, T. A., Kallakury, B. V., Sheehan, C. E. and    Ross, J. S. (1997) Expression of CD44 standard form and variant    isoforms in non-small cell lung carcinomas. Hum Pathol 28, 809-14.-   [56] Takigawa, N., Segawa, Y., Mandai, K., Takata, I. and    Fujimoto, N. (1997) Serum CD44 levels in patients with non-small    cell lung cancer and their relationship with clinicopathological    features. Lung Cancer 18, 147-57.-   [57] Kondo, K., Miyoshi, T., Hino, N., Shimizu, E., Masuda, N.,    Takada, M., Uyama, T. and Monden, Y. (1998) High frequency    expressions of CD44 standard and variant forms in non-small cell    lung cancers, but not in small cell lung cancers. J Surg Oncol 69,    128-36.-   [58] Sasaki, J. I., Tanabe, K. K., Takahashi, K., Okamoto, I.,    Fujimoto, H., Matsumoto, M., Suga, M., Ando, M. and Saya, H. (1998)    Expression of CD44 splicing isoforms in lung cancers: dominant    expression of CD44v8-10 in non-small cell lung carcinomas. Int J    Oncol 12, 525-33.-   [59] Volm, M., Koomagi, R., Mattern, J. and Stammler, G. (1997)    Cyclin A is associated with an unfavourable outcome in patients with    non-small-cell lung carcinomas. Br J Cancer 75, 1774-8.-   [60] Dobashi, Y., Shoji, M., Jiang, S. X., Kobayashi, M.,    Kawakubo, Y. and Kameya, T. (1998) Active cyclin A-CDK2 complex, a    possible critical factor for cell proliferation in human primary    lung carcinomas. Am J Pathol 153, 963-72.-   [61] Volm, M., Rittgen, W. and Drings, P. (1998) Prognostic value of    ERBB-1, VEGF, cyclin A, FOS, JUN and MYC in patients with squamous    cell lung carcinomas [published erratum appears in Br J Cancer 1998    April; 77(7):1198]. Br J Cancer 77, 663-9.-   [62] Shoji, M., Dobashi, Y., Morinaga, S., Jiang, S. X. and    Kameya, T. (1999) Tumor extension and cell proliferation in    adenocarcinomas of the lung. Am J Pathol 154, 909-18.-   [63] Shapiro, G. I., Edwards, C. D., Kobzik, L., Godleski, J.,    Richards, W., Sugarbaker, D. J. and Rollins, B. J. (1995) Reciprocal    Rb inactivation and p161NK4 expression in primary lung cancers and    cell lines. Cancer Res 55, 505-9.-   [64] Betticher, D. C., Heighway, J., Hasleton, P. S., Altermatt, H.    J., Ryder, W. D., Cerny, T. and Thatcher, N. (1996) Prognostic    significance of CCND1 (cyclin D1) overexpression in primary resected    non-small-cell lung cancer. Br J Cancer 73, 294-300.-   [65] Mate, J. L., Ariza, A., Aracil, C., Lopez, D., Isamat, M.,    Perez-Piteira, J. and Navas-Palacios, J. J. (1996) Cyclin D1    overexpression in non-small cell lung carcinoma: correlation with    Ki67 labelling index and poor cytoplasmic differentiation. J Pathol    180, 395-9.-   [66] Yang, W. I., Chung, K. Y., Shin, D. H. and Kim, Y. B. (1996)    Cyclin D1 protein expression in lung cancer. Yonsei Med J 37,    142-50.-   [67] Betticher, D. C., Heighway, J., Thatcher, N. and    Hasleton, P. S. (1997) Abnormal expression of CCND1 and RB 1 in    resection margin epithelia of lung cancer patients. Br J Cancer 75,    1761-8.-   [68] Nishio, M., Koshikawa, T., Yatabe, Y., Kuroishi, T., Suyama,    M., Nagatake, M., Sugiura, T., Ariyoshi, Y., Mitsudomi, T. and    Takahashi, T. (1997) Prognostic significance of cyclin D1 and    retinoblastoma expression in combination with p53 abnormalities in    primary, resected non-small cell lung cancers. Clin Cancer Res 3,    1051-8.-   [69] Caputi, M., De Luca, L., Papaccio, G., A, D. A., Cavallotti,    I., Scala, P., Scarano, F., Manna, M., Gualdiero, L. and De    Luca, B. (1997) Prognostic role of cyclin D1 in non small cell lung    cancer: an immunohistochemical analysis. Eur J Histochem 41, 133-8.-   [70] Betticher, D. C., White, G. R., Vonlanthen, S., Liu, X.,    Kappeler, A., Altermatt, H. J., Thatcher, N. and Heighway, J. (1997)    G1 control gene status is frequently altered in resectable non-small    cell lung cancer. Int J Cancer 74, 556-62.-   [71] Volm, M., Koomagi, R. and Rittgen, W. (1998) Clinical    implications of cyclins, cyclin-dependent kinases, RB and E2F1 in    squamous-cell lung carcinoma. Int J Cancer 79, 294-9.-   [72] Kurasono, Y., Ito, T., Kameda, Y., Nakamura, N. and    Kitamura, H. (1998) Expression of cyclin D1, retinoblastoma gene    protein, and p16 MTS1 protein in atypical adenomatous hyperplasia    and adenocarcinoma of the lung. An immunohistochemical analysis.    Virchows Arch 432, 207-15.-   [73] Tanaka, H., Fujii, Y., Hirabayashi, H., Miyoshi, S., Sakaguchi,    M., Yoon, H. E. and Matsuda, H. (1998) Disruption of the RB pathway    and cell-proliferative activity in non-small-cell lung cancers. Int    J Cancer 79, 111-5.-   [74] Olivero, M., Rizzo, M., Madeddu, R., Casadio, C.,    Pennacchietti, S., Nicotra, M. R., Prat, M., Maggi, G., Arena, N.,    Natali, P. G., Comoglio, P. M. and Di Renzo, M. F. (1996)    Overexpression and activation of hepatocyte growth factor/scatter    factor in human non-small-cell lung carcinomas. Br J Cancer 74,    1862-8.-   [75] Harvey, P., Warn, A., Newman, P., Perry, L. J., Ball, R. Y. and    Warn, R. M. (1996) Immunoreactivity for hepatocyte growth    factor/scatter factor and its receptor, met, in human lung    carcinomas and malignant mesotheliomas. J Pathol 180, 389-94.-   [76] Takanami, I., Tanana, F., Hashizume, T., Kikuchi, K., Yamamoto,    Y., Yamamoto, T. and Kodaira, S. (1996) Hepatocyte growth factor and    c-Met/hepatocyte growth factor receptor in pulmonary    adenocarcinomas: an evaluation of their expression as prognostic    markers. Oncology 53, 392-7.-   [77] Siegfried, J. M., Weissfeld, L. A., Luketich, J. D., Weyant, R.    J., Gubish, C. T. and Landreneau, R. J. (1998) The clinical    significance of hepatocyte growth factor for non-small cell lung    cancer. Ann Thorac Surg 66, 1915-8.-   [78] Nguyen, P. L., Niehans, G. A., Cherwitz, D. L., Kim, Y. S. and    Ho, S. B. (1996) Membrane-bound (MUC1) and secretory (MUC2, MUC3,    and MUC4) mucin gene expression in human lung cancer. Tumour Biol    17, 176-92.-   [79] Yu, C. J., Yang, P. C., Shun, C. T., Lee, Y. C., Kuo, S. H. and    Luh, K. T. (1996) Overexpression of MUC5 genes is associated with    early post-operative metastasis in non-small-cell lung cancer. Int J    Cancer 69, 457-65.-   [80] Yu, C. J., Shun, C. T., Yang, P. C., Lee, Y. C., Shew, J. Y.,    Kuo, S. H. and Luh, K. T. (1997) Sialomucin expression is associated    with erbB-2 oncoprotein overexpression, early recurrence, and cancer    death in non-small-cell lung cancer [published erratum appears in Am    J Respir Crit Care Med 1997 August; 156(2 Pt 1):677-8]. Am J Respir    Crit Care Med 155, 1419-27.-   [81] Jarrard, J. A., Linnoila, R. I., Lee, H., Steinberg, S. M.,    Witschi, H. and Szabo, E. (1998) MUC1 is a novel marker for the type    II pneumocyte lineage during lung carcinogenesis. Cancer Res 58,    5582-9.-   [82] Ohgami, A., Tsuda, T., Osaki, T., Mitsudomi, T., Morimoto, Y.,    Higashi, T. and Yasumoto, K. (1999) MUC1 mucin mRNA expression in    stage I lung adenocarcinoma and its association with early    recurrence. Ann Thorac Surg 67, 810-4.-   [83] Bejarano, P. A., Baughman, R. P., Biddinger, P. W., Miller, M.    A., Fenoglio-Preiser, C., al-Kafaji, B., Di Lauro, R. and    Whitsett, J. A. (1996) Surfactant proteins and thyroid transcription    factor-1 in pulmonary and breast carcinomas. Mod Pathol 9, 445-52.-   [84] Harlamert, H. A., Mira, J., Bejarano, P. A., Baughman, R. P.,    Miller, M. A., Whitsett, J. A. and Yassin, R. (1998) Thyroid    transcription factor-1 and cytokeratins 7 and 20 in pulmonary and    breast carcinoma. Acta Cytol 42, 1382-8.-   [85] Fontanini, G., Vignati, S., Lucchi, M., Mussi, A., Calcinai,    A., Boldrini, L., Chine, S., Silvestri, V., Angeletti, C. A.,    Basolo, F. and Bevilacqua, G. (1997) Neoangiogenesis and p53 protein    in lung cancer: their prognostic role and their relation with    vascular endothelial growth factor (VEGF) expression [see comments].    Br J Cancer 75, 1295-301.-   [86] Shibusa, T., Shijubo, N. and Abe, S. (1998) Tumor angiogenesis    and vascular endothelial growth factor expression in stage I lung    adenocarcinoma. Clin Cancer Res 4, 1483-7.-   [87] Giatromanolaki, A., Koukourakis, M. I., Kakolyris, S., Turley,    H., O'Byrne, K., Scott, P. A., Pezzella, F., Georgoulias, V.,    Harris, A. L. and Gatter, K. C. (1998) Vascular endothelial. growth    factor, wild-type p53, and angiogenesis in early operable non-small    cell lung cancer. Clin Cancer Res 4, 3017-24.-   [88] Fontanini, G., Boldrini, L., Vignati, S., Chine, S., Basolo,    F., Silvestri, V., Lucchi, M., Mussi, A., Angeletti, C. A. and    Bevilacqua, G. (1998) Bcl2 and p53 regulate vascular endothelial    growth factor (VEGF)— mediated angiogenesis in non-small cell lung    carcinoma. Eur J Cancer 34, 718-23.-   [89] Takahama, M., Tsutsumi, M., Tsujiuchi, T., Kido, A., Okajima,    E., Nezu, K., Tojo, T., Kushibe, K., Kitamura, S. and    Konishi, Y. (1998) Frequent expression of the vascular endothelial    growth factor in human non-small-cell lung cancers. Jpn J Clin Oncol    28, 176-81.-   [90] Sozzi, G., Miozzo, M., Tagliabue, E., Calderone, C., Lombardi,    L., Pilotti, S., Pastorino, U., Pierotti, M. A. and Della    Porta, G. (1991) Cytogenetic abnormalities and overexpression of    receptors for growth factors in normal bronchial epithelium and    tumor samples of lung cancer patients. Cancer Res 51, 400-4.-   [91] Volm, M., Efferth, T., Mattern, J. and Wodrich, W. (1992)    Overexpression of c-fos and c-erbB1 encoded proteins in squamous    cell carcinomas of the lung of smokers. Int J Oncol 1, 69-71 1992.-   [92] Wodrich, W. and Volm, M. (1993) Overexpression of oncoproteins    in non-small cell lung carcinomas of smokers. Carcinogenesis 14,    1121-4.-   [93] Pastorino, U., Sozzi, G., Miozzo, M., Tagliabue, E.,    Pilotti, S. and Pierotti, M. A. (1993) Genetic changes in lung    cancer. J Cell Biochem Suppl 17F, 237-48.-   [94] Gorgoulis, V., Sfikakis, P. P., Karameris, A., Papastamatiou,    H., Trigidou, R., Veslemes, M., Spandidos, D. A., Sfikakis, P. and    Jordanoglou, J. (1995) Molecular and immunohistochemical study of    class I growth factor receptors in squamous cell lung carcinomas.    Pathol Res Pract 191, 973-81.-   [95] Rusch, V., Klimstra, D., Linkov, I. and Dmitrovsky, E. (1995)    Aberrant expression of p53 or the epidermal growth factor receptor    is frequent in early bronchial neoplasia and coexpression precedes    squamous cell carcinoma development. Cancer Res 55, 1365-72.-   [96] Rusch, V. W. and Dmitrovsky, E. (1995) Molecular biologic    features of non-small cell lung cancer. Clinical implications. Chest    Surg Clin N Am 5, 39-55.-   [97] Fontanini, G., Vignati, S., Bigini, D., Mussi, A., Lucchi, H.,    Angeletti, C. A., Pingitore, R., Pepe, S., Basolo, F. and    Bevilacqua, G. (1995) Epidermal growth factor receptor (EGFr)    expression in non-small cell lung carcinomas correlates with    metastatic involvement of hilar and mediastinal lymph nodes in the    squamous subtype. Eur J Cancer 31A, 178-83.-   [98] Pflug, B. and Djakiew, D. (1996) Expression of the low affinity    nerve growth factor receptor in prostate epithelial cells negatively    regulates nerve growth factor-mediated growth via induction of    apoptosis (Meeting abstract). Proc Annu Meet Am Assoc Cancer Res 37,    A262 1996.-   [99] Rusch, V., Klimstra, D. Venkatraman, E., Langenfeld, J.,    Pisters, P. and Dmitrovsky, E. (1996) Overexpression of EGFR and    TGF-alpha is frequent in early stage non-small cell lung cancer, but    does not predict tumor progression (Meeting abstract). Proc Annu    Meet Am Assoc Cancer Res 37, A1314 1996.-   [100] Fujino, S., Enokibori, T., Tezuka, N., Asada, Y., Inoue, S.,    Kato, H. and Mori, A. (1996) A comparison of epidermal growth factor    receptor levels and other prognostic parameters in non-small cell    lung cancer. Eur J Cancer 32A, 2070-4.-   [101] Pastorino, U., Andreola, S., Tagliabue, E., Pezzella, F.,    Incarbone, M., Sozzi, G., Buyse, M., Menard, S., Pierotti, M. and    Rilke, F. (1997) Immunocytochemical markers in stage I lung cancer:    relevance to prognosis. J Clin Oncol 15, 2858-65.-   [102] Sekine, I., Takami, S., Guang, S. G., Yokose, T., Kodama, T.,    Nishiwaki, Y., Kinoshita, M., Matsumoto, H., Ogura, T. and    Nagai, K. (1998) Role of epidermal growth factor receptor    overexpression, K-ras point mutation and c-myc amplification in the    carcinogenesis of non-small cell lung cancer. Oncol Rep 5, 351-4.-   [103] Pfeiffer, P., Nexo, E., Bentzen, S. M., Clausen, P. P.,    Andersen, K., Rose, C. and Nex, E. (1998) Enzyme-linked    immunosorbent assay of epidermal growth factor receptor in lung    cancer: comparisons with immunohistochemistry, clinicopathological    features and prognosis. Br J Cancer 78, 96-9.-   [104] D'Amico, T. A., Massey, M., Hemdon, J. E., 2nd, Moore, M. B.    and Harpole, D. H., Jr. (1999) A biologic risk model for stage I    lung cancer: immunohistochemical analysis of 408 patients with the    use of ten molecular markers. J Thorac Cardiovasc Surg 117, 736-43.-   [105] Engel, M., Theisinger, B., Seib, T., Seitz, G., Huwer, H.,    Zang, K. D., Welter, C. and Dooley, S. (1993) High levels of nm23-H1    and nm23-H2 messenger RNA in human squamous-cell lung carcinoma are    associated with poor differentiation and advanced tumor stages. Int    J Cancer 55, 375-9.-   [106] Ozeki, Y., Takishima, K. and Mamiya, G. (1994)    Immunohistochemical analysis of nm23/NDP kinase expression in human    lung adenocarcinoma: association with tumor progression in Clara    cell type. Jpn J Cancer Res 85, 840-6.-   [107] Lai, W. W., Wu, M. H., Yan, J. J. and Chen, F. F. (1996)    Immunohistochemical analysis of nm23-H 1 in stage I non-small cell    lung cancer: a useful marker in prediction of metastases. Ann Thorac    Surg 62, 1500-4.-   [108] Gazzeri, S., Brambilla, E., Negoescu, A., Thoraval, D., Veron,    M., Moro, D. and Brambilla, C. (1996) Overexpression of nucleoside    diphosphate/kinase A/nm23-H1 protein in human lung tumors:    association with tumor progression in squamous carcinoma. Lab Invest    74, 158-67.-   [109] MacKinnon, M., Kerr, K. M., King, G., Kennedy, M. M.,    Cockburn, J. S. and Jeffrey, R. R. (1997) p53, c-erbB-2 and nm23    expression have no prognostic significance in primary pulmonary    adenocarcinoma. Eur J Cardiothorac Surg 11, 838-42.-   [110] Bosnar, M. H., Pavelic, K., Krizanac, S., Slobodnjak, Z. and    Pavelic, J. (1997) Squamous cell lung carcinomas: the role of    nm23-H1 gene. J Mol Med 75, 609-13.-   [111] Kawakubo, Y., Sato, Y., Koh, T., Kono, H. and    Kameya, T. (1997) Expression of nm23 protein in pulmonary    adenocarcinomas: inverse correlation to tumor progression. Lung    Cancer 17, 103-13.-   [112] Ritter, J. H., Dresler, C. M. and Wick, M. R. (1995)    Expression of bcl-2 protein in stage T1N0M0 non-small cell lung    carcinoma. Hum Pathol 26, 1227-32.-   [113] Kitagawa, Y., Wong, F., Lo, P., Elliott, M., Verburgt, L. M.,    Hogg, J. C. and Daya, M. (1996) Overexpression of Bcl-2 and    mutations in p53 and K-ras in resected human non-small cell lung    cancers. Am J Respir Cell Mol Biol 15, 45-54.-   [114] Rao, S. K., Krishna, M., Woda, B. A., Savas, L. and    Fraire, A. E. (1996) Immunohistochemical detection of bcl-2 protein    in adenocarcinoma and non-neoplastic cellular compartments of the    lung. Mod Pathol 9, 555-9.-   [115] Boers, J. E., ten Velde, G. P. and Thunnissen, F. B. (1996)    P53 in squamous metaplasia: a marker for risk of respiratory tract    carcinoma. Am J Respir Crit Care Med 153, 411-6.-   [116] Coppola, D., Clarke, M., Landreneau, R., Weyant, R. J.,    Cooper, D. and Yousem, S. A. (1996) Bcl-2, p53, CD44, and CD44v6    isoform expression in neuroendocrine tumors of the lung. Mod Pathol    9, 484-90.-   [117] Higashiyama, M., Doi, O., Kodama, K., Yokouchi, H. and    Tateishi, R. (1996) Bcl-2 oncoprotein expression is increased    especially in the portion of small cell carcinoma within the    combined type of small cell lung cancer. Tumour Biol 17, 341-4.-   [118] Strauss, G. M. (1997) Prognostic markers in resectable    non-small cell lung cancer. Hematol Oncol Clin North Am 11, 409-34.-   [119] Anton, R. C., Brown, R. W., Younes, M., Gondo, M. M.,    Stephenson, M. A. and Cagle, P. T. (1997) Absence of prognostic    significance of bcl-2 immunopositivity in non-small cell lung    cancer: analysis of 427 cases. Hum Pathol 28, 1079-82.-   [120] Ishida, H., Irie, K., Itoh, T., Furukawa, T. and    Tokunaga, O. (1997) The prognostic significance of p53 and bcl-2    expression in lung adenocarcinoma and its correlation with Ki-67    growth fraction. Cancer 80, 1034-45.-   [121] Stefanaki, K., Rontogiannis, D., Vamvouka, C., Bolioti, S.,    Chaniotis, V., Sotsiou, F., Vlychou, M., Delidis, G., Kakolyris, S.,    Georgoulias, V. and Kanavaros, P. (1998) Immunohistochemical    detection of bcl2, p53, mdm2 and p21/wafl proteins in small-cell    lung carcinomas. Anticancer Res 18, 1167-73.-   [122] Brambilla, E., Gazzeri, S., Lantuejoul, S., Coll, J. L., Moro,    D., Negoescu, A. and Brambilla, C. (1998) p53 mutant immunophenotype    and deregulation of p53 transcription pathway (Bcl2, Bax, and Wafl)    in precursor bronchial lesions of lung cancer. Clin Cancer Res 4,    1609-18.-   [123] Salgia, R. and Skarin, A. T. (1998) Molecular abnormalities in    lung cancer. J Clin Oncol 16, 1207-17.-   [124] Kim, Y. C., Park, K. O., Kern, J. A., Park, C. S., Lim, S. C.,    Jang, A. S. and Yang, J. B. (1998) The interactive effect of Ras,    HER2, P53 and Bcl-2 expression in predicting the survival of    non-small cell lung cancer patients. Lung Cancer 22, 181-90.-   [125] Groeger, A. M., Caputi, M., Esposito, V., De Luca, A., Salat,    A., Murabito, M., Giordano, G. G., Baldi, F., Giordano, A. and    Wolner, E. (1999) Bcl-2 protein expression correlates with nodal    status in non small cell lung cancer. Anticancer Res 19, 821-4.-   [126] Vargas, S. O., Leslie, K. O., Vacek, P. M., Socinski, M. A.    and Weaver, D. L. (1998) Estrogen-receptor-related protein p29 in    primary nonsmall cell lung carcinoma: pathologic and prognostic    correlations. Cancer 82, 1495-500.-   [127] Higashiyama, M., Doi, O., Kodama, K., Yokouchi, H. and    Tateishi, R. (1994) Retinoblastoma protein expression in lung    cancer: an immunohistochemical analysis. Oncology 51, 544-51.-   [128] Xu, H. J., Quinlan, D. C., Davidson, A. G. and et. al. (1994)    Altered retinoblastoma protein expression and prognosis in early    stage non-small cell lung carcinoma. J. Natl. Cancer Inst. 86,    695-699.-   [129] Lee, J. S., Kalapurakal, S., Ro, J. Y. and Hong, W. K. (1995)    Prognostic significance of retinoblastoma protein expression in    non-small cell lung cancer (Meeting abstract). Proc Annu Meet Am    Assoc Cancer Res 36, A3787 1995.-   [130] Dixon, G., Salisbury, J. and Walker, C. (1995) Expression of    the retinoblastoma protein in normal and dysplastic bronchial    epithelium and lung cancer (Meeting abstract). J Pathol 176, 32A    1995.-   [131] Shapiro, G. I., Edwards, C. D., Kobzik, L., Godleski, J.,    Richards, W., Sugarbaker, D. J. and Rollins, B. J. (1995) Reciprocal    Rb inactivation and p16 expression in lung cancer (Meeting    abstract). Proc Annu Meet Am Assoc Cancer Res 36, A164 1995.-   [132] Volm, M. and Stammler, G. (1996) Retinoblastoma (Rb) protein    expression and resistance in squamous cell lung carcinomas.    Anticancer Res 16, 891-4.-   [133] Dosaka-Akita, H., Hu, S. X., Kinoshita, I., Fujino, M.,    Harada, M., Kawakami, Y. and Benedict, W. F. (1996) Prognostic    significance of Rb protein expression in non-small cell lung cancer    (NSCLC) (Meeting abstract). Proc Annu Meet Am Assoc Cancer Res 37,    A1401 1996.-   [134] Kinoshita, I., Dosaka-Akita, H., Akie, K., Mishina, T.,    Hiroumi, H. and Kawakami, Y. (1996) Significance of abnormal p161NK4    and RB protein expression in non-small cell lung cancer (NSCLC)    (Meeting abstract). Proc Annu Meet Am Assoc Cancer Res 37, A3979    1996.-   [135] Kratzke, R. A., Greatens, T. M., Rubins, J. B., Maddaus, M.    A., Niewoehner, D. E., Niehans, G. A. and Geradts, J. (1996) Rb and    p161NK4a expression in resected non-small cell lung tumors. Cancer    Res 56, 3415-20.-   [136] Sakaguchi, M., Fujii, Y., Hirabayashi, H., Yoon, H. E.,    Komoto, Y., Oue, T., Kusafuka, T., Okada, A. and Matsuda, H. (1996)    Inversely correlated expression of p16 and Rb protein in non-small    cell lung cancers: an immunohistochemical study. Int J Cancer 65,    442-5.-   [137] Xu, H. J., Cagle, P. T., Hu, S. X., Li, J. and    Benedict, W. F. (1996) Altered retinoblastoma and p53 protein status    in non-small cell carcinoma of the lung: potential synergistic    effects on prognosis. Clin. Cancer Res 2, 1169-76 1996.-   [138] Dosaka-Akita, H., Hu, S. X., Fujino, M., Harada, M.,    Kinoshita, I., Xu, H. J., Kuzumaki, N., Kawakami, Y. and    Benedict, W. F. (1997) Altered retinoblastoma protein expression in    nonsmall cell lung cancer: its synergistic effects with altered ras    and p53 protein status on prognosis. Cancer 79, 1329-37.-   [139] Cagle, P. T., el-Naggar, A. K., Xu, H. J., Hu, S. X. and    Benedict, W. F. (1997) Differential retinoblastoma protein    expression in neuroendocrine tumors of the lung. Potential    diagnostic implications. Am J Pathol 150, 393-400.-   [140] Kashiwabara, K., Oyama, T., Sano, T., Fukuda, T. and    Nakajima, T. (1998) Correlation between methylation status of the    p16/CDKN2 gene and the expression of p16 and Rb proteins in primary    non-small cell lung cancers. Int J Cancer 79, 215-20.-   [141] Caputi, M., Esposito, V., Groger, A. M., De Luca, A., Pacilio,    C., Dekan, G., Giordano, G. G., Baldi, F., Wolner, E. and    Giordano, A. (1998) RB growth control evasion in lung cancer.    Anticancer Res 18, 2371-4.-   [142] Tamura, A., Matsubara, O., Hirokawa, K. and Aoki, N. (1993)    Detection of thrombomodulin in human lung cancer cells. Am J Pathol    142, 79-85.-   [143] Tamura, A., Komatsu, H., Hebisawa, A., Kurashima, A., Mori, M.    and Katayama, T. (1996) Is thrombomodulin useful as a tumor marker    of a lung cancer? Lung Cancer 15, 189-95.-   [144] Collins, C. L., Ordonez, N. G., Schaefer, R., Cook, C. D.,    Xie, S. S., Granger, J., Hsu, P. L., Fink, L. and Hsu, S. M. (1992)    Thrombomodulin expression in malignant pleural mesothelioma and    pulmonary adenocarcinoma. Am J Pathol 141, 827-33.-   [145] Hamatake, M., Ishida, T., Mitsudomi, T., Akazawa, K. and    Sugimachi, K. (1996) Prognostic value and clinicopathological    correlation of thrombomodulin in squamous cell carcinoma of the    human lung. Clin Cancer Res 2, 763-6 1996.-   [146]. Ordonez, N. G. (1997) Value of thrombomodulin immunostaining    in the diagnosis of mesothelioma. Histopathology 31, 25-30.-   [147] Tolnay, E., Wiethege, T. and Muller, K. M. (1997) Expression    and localization of thrombomodulin in preneoplastic bronchial    lesions and in lung cancer. Virchows Arch 430, 209-12.-   [148] Bohm, M., Totzeck, B. and Wieland, I. (1994) Differences of    E-cadherin expression levels and patterns in human lung cancer. Ann    Hematol 68, 81-3.-   [149] Bohm, M., Totzeck, B., Birchmeier, W. and Wieland, I. (1994)    Differences of E-cadherin expression levels and patterns in primary    and metastatic human lung cancer. Clin Exp Metastasis 12, 55-62.-   [150] Peralta Soler, A., Knudsen, K. A., Jaurand, M. C., Johnson, K.    R., Wheelock, M. J., Klein-Szanto, A. J. and Salazar, H. (1995) The    differential expression of N-cadherin and E-cadherin distinguishes    pleural mesotheliomas from lung adenocarcinomas [see comments]. Hum    Pathol 26, 1363-9.-   [151] Han, A. C., Peralta-Soler, A., Knudsen, K. A., Wheelock, M.    J., Johnson, K. R. and Salazar, H. (1997) Differential expression of    N-cadherin in pleural mesotheliomas and E-cadherin in lung    adenocarcinomas in formalin-fixed, paraffin-embedded tissues [see    comments]. Hum Pathol 28, 641-5.-   [152] Weynants, P., Lethe, B., Brasseur, F., Marchand, M. and    Boon, T. (1994) Expression of mage genes by non-small-cell lung    carcinomas. Int J Cancer 56, 826-9.-   [153] Shichijo, S., Hayashi, A., Takamori, S., Tsunosue, R.,    Hoshino, T., Sakata, M., Kuramoto, T., Oizumi, K. and    Itoh, K. (1995) Detection of MAGE-4 protein in lung cancers. Int J    Cancer 64, 158-65.-   [154] Sakata, M. (1996) Expression of MAGE gene family in lung    cancers. Kurume Med J 43, 55-61.-   [155] Fischer, C., Gudat, F., Stulz, P., Noppen, C., Schaefer, C.,    Zajac, P., Trutmann, M., Kocher, T., Zuber, M., Harder, F.,    Heberer, M. and Spagnoli, G. C. (1997) High expression of MAGE-3    protein in squamous-cell lung carcinoma [letter]. Int J Cancer 71,    1119-21.-   [156] Gotoh, K., Yatabe, Y., Sugiura, T., Takagi, K., Ogawa, M.,    Takahashi, T. and Mitsudomi, T. (1998) Frequency of MAGE-3 gene    expression in HLA-A2 positive patients with non-small cell lung    cancer. Lung Cancer 20, 117-25.-   [157] Uchiyama, B., Saijo, Y., Kumano, N., Abe, T., Fujimura, S.,    Ohkuda, K., Handa, M., Satoh, K. and Nukiwa, T. (1997) Expression of    nucleolar protein p120 in human lung cancer: difference in    histological types as a marker for proliferation. Clin Cancer Res 3,    1873-7.-   [158] Singh, G., Scheithauer, B. W. and Katyal, S. L. (1986) The    pathobiologic features of carcinomas of type II pneumocytes. An    immunocytologic study. Cancer 57, 994-9.-   [159] Mizutani, Y., Nakajima, T., Morinaga, S., Gotoh, M.,    Shimosato, Y., Akino, T. and Suzuki, A. (1988) Immunohistochemical    localization of pulmonary surfactant apoproteins in various lung    tumors. Special reference to nonmucus producing lung    adenocarcinomas. Cancer 61, 532-7.-   [160] Noguchi, M., Nakajima, T., Hirohashi, S., Akiba, T. and    Shimosato, Y. (1989) Immunohistochemical distinction of malignant    mesothelioma from pulmonary adenocarcinoma with anti-surfactant    apoprotein, anti-Lewisa, and anti-Tn antibodies. Hum Pathol 20,    53-7.-   [161] Linnoila, R. I., Jensen, S. M., Steinberg, S. M., Mulshine, J.    L., Eggleston, J. C. and Gazdar, A. F. (1992) Peripheral airway cell    marker expression in non-small cell lung carcinoma. Association with    distinct clinicopathologic features [see comments]. Am J Clin Pathol    97, 233-43.-   [162] Shijubo, N., Tsutahara, S., Hirasawa, M., Takahashi, H.,    Honda, Y., Suzuki, A., Kuroki, Y. and Akino, T. (1992) Pulmonary    surfactant protein A in pleural effusions. Cancer 69, 2905-9.-   [163] Shijubo, N., Honda, Y., Fujishima, T., Takahashi, H., Kodama,    T., Kuroki, Y., Akino, T. and Abe, S. (1995) Lung surfactant    protein-A and carcinoembryonic antigen in pleural effusions due to    lung adenocarcinoma and malignant mesothelioma. Eur Respir J 8,    403-6.-   [164] Nicholson, A. G., McCormick, C. J., Shimosato, Y.,    Butcher, D. N. and Sheppard, M. N. (1995) The value of PE-10, a    monoclonal antibody against pulmonary surfactant, in distinguishing    primary and metastatic lung tumours. Histopathology 27, 57-60.-   [165] Khoor, A., Whitsett, J. A., Stahlman, M. T. and    Halter, S. A. (1997) Expression of surfactant protein B precursor    and surfactant protein B mNA in adenocarcinoma of the lung. Mod    Pathol 10, 62-7.-   [166] Saitoh, H., Shimura, S., Fushimi, T., Okayama, H. and    Shirato, K. (1997) Detection of surfactant protein-A gene transcript    in the cells from pleural effusion for the diagnosis of lung    adenocarcinoma. Am J Med 103, 400-4.-   [167] Grohs et al., Acta Cytologica, 1996, 40(1):26-30.-   [168] Grohs et al., Acta Cytologica, 1997, 41(1):144-152.

1.-34. (canceled)
 35. A panel for discriminating adenocarcinoma amongother types of lung cancer, wherein the panel comprises a plurality ofprobes each of which specifically binds to a marker associated with aspecific type of lung cancer, wherein: (a) the pattern of binding of thecomponent probes of the panel to cells in a cytology samplediscriminates adenocarcinoma among other types of lung cancer, and (b)the plurality of probes comprises a probe that binds to Mucin 1 or acorrelate marker thereof and a probe that binds to thyroid transcriptionfactor 1 or a correlate marker thereof, wherein “correlate markers” areas depicted in the correlation matrix shown in FIG.
 4. 36. The panel ofclaim 35, wherein the plurality of probes further comprises at least oneprobe that binds to a marker selected from the group consisting of VEGF,SP-A, BCL-2, ER-related (p29), Glut 3, and correlate markers thereof asdepicted in the correlation matrix shown in FIG.
 4. 37. A panel fordiscriminating squamous cell carcinoma among other types of lung cancer,wherein the panel comprises a plurality of probes each of whichspecifically binds to a marker associated with a specific type of lungcancer, wherein: (a) the pattern of binding of the component probes ofthe panel to cells in a cytology sample discriminates squamous cellcarcinoma among other types of lung cancer, and (b) the plurality ofprobes comprises a probe that binds to CD44v6 or a correlate markerthereof and a probe that binds to ER-related (p29) or a correlate markerthereof, wherein “correlate markers” are as depicted in the correlationmatrix shown in FIG.
 4. 38. The panel of claim 37, wherein the pluralityof probes further comprises at least one probe that binds to a markerselected from the group consisting of VEGF, thrombomodulin, Glut 1, MAGE3, and correlate markers thereof as depicted in the correlation matrixshown in FIG.
 4. 39. A panel for discriminating large cell carcinomaamong other types of lung cancer, wherein the panel comprises aplurality of probes each of which specifically binds to a markerassociated with a specific type of lung cancer, wherein: (a) the patternof binding of the component probes of the panel to cells in a cytologysample discriminates large cell carcinoma among other types of lungcancer, and (b) the plurality of probes comprises a probe that binds toVEGF or a correlate marker thereof and a probe that binds to P120 or acorrelate marker thereof, wherein “correlate markers” are as depicted inthe correlation matrix shown in FIG.
 4. 40. The panel of claim 39,wherein the plurality of probes further comprises a probe that binds toCyclin A or a correlate marker thereof as depicted in the correlationmatrix shown in FIG.
 4. 41. A panel for discriminating mesotheliomaamong other types of lung cancer, wherein the panel comprises aplurality of probes each of which specifically binds to a markerassociated with a specific type of lung cancer, wherein: (a) the patternof binding of the component probes of the panel to cells in a cytologysample discriminates mesothelioma among other types of lung cancer, and(b) the plurality of probes comprises a probe that binds to CD44v6 or acorrelate marker thereof, a probe that binds to PCNA or a correlatemarker thereof and a probe that binds to HERA or a correlate markerthereof, wherein “correlate markers” are as depicted in the correlationmatrix shown in FIG.
 4. 42. A panel for discriminating mesotheliomaamong other types of lung cancer, wherein the panel comprises aplurality of probes each of which specifically binds to a markerassociated with a specific type of lung cancer, wherein: (a) the patternof binding of the component probes of the panel to cells in a cytologysample discriminates mesothelioma among other types of lung cancer, and(b) the plurality of probes comprises a probe that binds to CD44v6 or acorrelate marker thereof as depicted in the correlation matrix shown inFIG.
 4. 43. A panel for discriminating mesothelioma among other types oflung cancer, wherein the panel comprises a plurality of probes each ofwhich specifically binds to a marker associated with a specific type oflung cancer, wherein: (a) the pattern of binding of the component probesof the panel to cells in a cytology sample discriminates mesotheliomaamong other types of lung cancer, and (b) the plurality of probescomprises a probe that binds to PCNA or a correlate marker thereof asdepicted in the correlation matrix shown in FIG.
 4. 44. A panel fordiscriminating small cell carcinoma among other types of lung cancer,wherein the panel comprises a plurality of probes each of whichspecifically binds to a marker associated with a specific type of lungcancer, wherein: (a) the pattern of binding of the component probes ofthe panel to cells in a cytology sample discriminates small cellcarcinoma among other types of lung cancer, and (b) the plurality ofprobes comprises a probe that binds to PCNA or a correlate markerthereof, a probe that binds to BCL-2 or a correlate marker thereof, anda probe that binds to EGFR or a correlate marker thereof, wherein“correlate markers” are as depicted in the correlation matrix shown inFIG.
 4. 45. A cell-based method for discriminating adenocarcinoma amongother types of lung cancer comprising: (a) contacting cells from acytology sample on one or more microscope slides with a panel comprisinga plurality of probes; and (b) analyzing the pattern of binding of thecomponent probes of the panel to cells in the cytology sample todiscriminate adenocarcinoma among other types of lung cancer, whereinthe plurality of probes comprises a probe that binds to Mucin 1 or acorrelate marker thereof and a probe that binds to thyroid transcriptionfactor 1 or a correlate marker thereof, wherein “correlate markers” areas depicted in the correlation matrix shown in FIG.
 4. 46. The method ofclaim 45, wherein the plurality of probes further comprises at least oneprobe that binds to a marker selected from the group consisting of VEGF,SP-A, BCL-2, ER-related (p29), Glut 3, and correlate markers thereof asdepicted in the correlation matrix shown in FIG.
 4. 47. A cell-basedmethod for discriminating squamous cell carcinoma among other types oflung cancer comprising: (a) contacting cells from a cytology sample onone or more microscope slides with a panel comprising a plurality ofprobes; and (b) analyzing the pattern of binding of the component probesof the panel to cells in the cytology sample to discriminate squamouscell carcinoma among other types of lung cancer, wherein the pluralityof probes comprises a probe that binds to CD44v6 or a correlate markerthereof and a probe that binds to ER-related (p29) or a correlate markerthereof, wherein “correlate markers” are as depicted in the correlationmatrix shown in FIG.
 4. 48. The method of claim 47, wherein saidplurality of probes further comprises at least one probe that binds to amarker selected from the group consisting of VEGF, thrombomodulin, Glut1, MAGE 3, and correlate markers thereof as depicted in the correlationmatrix shown in FIG.
 4. 49. A cell-based method for discriminating largecell carcinoma among other types of lung cancer comprising: (a)contacting cells from a cytology sample on one or more microscope slideswith a panel comprising a plurality of probes; and (b) analyzing thepattern of binding of the component probes of the panel to cells in thecytology sample to discriminate large cell carcinoma among other typesof lung cancer, wherein the plurality of probes comprises a probe thatbinds to VEGF or a correlate marker thereof and a probe that binds toP120 or a correlate marker thereof, wherein “correlate markers” are asdepicted in the correlation matrix shown in FIG.
 4. 50. The method ofclaim 49, wherein the plurality of probes further comprises a probe thatbinds to Cyclin A or a correlate marker thereof as depicted in thecorrelation matrix shown in FIG.
 4. 51. A cell-based method fordiscriminating mesothelioma among other types of lung cancer comprising:(a) contacting cells from a cytology sample on one or more microscopeslides with a panel comprising a plurality of probes; and (b) analyzingthe pattern of binding of the component probes of the panel to cells inthe cytology sample to discriminate mesothelioma among other types oflung cancer, wherein the plurality of probes comprises a probe thatbinds to CD44v6 or a correlate marker thereof, a probe that binds toPCNA or a correlate marker thereof, and a probe that binds to HERA or acorrelate marker thereof, wherein “correlate markers” are as depicted inthe correlation matrix shown in FIG.
 4. 52. A cell-based method fordiscriminating mesothelioma among other types of lung cancer comprising:(a) contacting cells from a cytology sample on one or more microscopeslides with a panel comprising a plurality of probes; and (b) analyzingthe pattern of binding of the component probes of the panel to cells inthe cytology sample to discriminate mesothelioma among other types oflung cancer, wherein the plurality of probes comprises a probe thatbinds to CD44v6 or a correlate marker thereof as depicted in thecorrelation matrix shown in FIG.
 4. 53. A cell-based method fordiscriminating mesothelioma among other types of lung cancer comprising:(a) contacting cell's from a cytology sample on one or more microscopeslides with a panel comprising a plurality of probes; and (b) analyzingthe pattern of binding of the component probes of the panel to cells inthe cytology sample to discriminate mesothelioma among other types oflung cancer, wherein the plurality of probes comprises a probe thatbinds to PCNA or a correlate marker thereof as depicted in thecorrelation matrix shown in FIG.
 4. 54. A cell-based method fordiscriminating small cell carcinoma among other types of lung cancercomprising: (a) contacting cells from a cytology sample on one or moremicroscope slides with a panel comprising a plurality of probes; and (b)analyzing the pattern of binding of the component probes of the panel tocells in the cytology sample to discriminate small cell carcinoma amongother types of lung cancer, wherein the plurality of probes comprises aprobe that binds to PCNA or a correlate marker thereof, a probe thatbinds to BCL-2 or a correlate marker thereof, and a probe that binds toEGFR or a correlate marker thereof, wherein “correlate markers” are asdepicted in the correlation matrix shown in FIG. 4.